An Academic Budget Inspired Raquel Urtasun to Design Affordable Solutions for Self-Driving

LDV Capital is focused on investing in people building visual technology businesses. Our LDV Vision Summit explores how visual technologies leveraging computer vision, machine learning, and artificial intelligence are revolutionizing how humans communicate and do business.

Raquel Urtasun.jpg

Raquel Urtasun is a recipient of the NVIDIA Pioneers of AI Award, three Google Faculty Research Awards, and several more. She lectures at the University of Toronto and the Vector Institute and is the head of Uber ATG, Toronto. At our LDV Vision Summit 2017, she spoke about how autonomous vehicles with human perception will make our cities smarter and better to live in.

It's my pleasure to be here today, and I wanted to introduce who I am just in case you guys don't know.

So I have three jobs, which keeps me quite busy. I am still an academic, one day a week I am I the University of Toronto and the Vector Institute which I co-found with a whole bunch of people that you see in the picture including Geoff Hinton. And the latest greatest news, I guess, as of May 1st 2017, I'm also heading a new lab of Uber ATG in Toronto, so self-driving cars are in Canada now and that's really, really exciting.

Today, I'm going to talk about what led to the Uber acquisition [of the University of Toronto team]. Perhaps you have already seen another discussion about why we need self-driving cars, but what is very important for me is actually that we need to lower the risk of accidents, we need to provide mobility for many people that right now cannot go to the place they want to go, and we need to think of the future of public transportation or ride-sharing. In particular, we need to share resources. Ninety-five percent of the time the car is parked, so we are just utilizing our planet without a real reason.

© Robert Wright/LDV Vision Summit 2017

© Robert Wright/LDV Vision Summit 2017

If we look at typically what is going on in self-driving car companies we find: they're pretty good at localization, path planning, and obstacle avoidance, but there are two things that they do which actually made them not super scalable. The first thing is LIDAR, the prices are dropping, but it is still quite expensive to buy a decent LIDAR. And the other thing, which is the been in the closet, is actually mapping.

What I have been working for the past seven years is how to make solutions that are scalable, meaning cheap sensors and trying to drive without maps or with as little prior knowledge as possible.

Now if you want to do something of this form, we need to think about many different things at once. The first thing that us at academic was difficult was data and so we created many years ago, I guess, it's still the only benchmark for self-driving which is KITTI. And to my despair, this is still the only benchmark, which I don't understand.

If we want to get rid of the LiDAR, get rid of the maps, one of the things that we need to...have is robust, good, and fast, stereo 3D reconstruction.

The other thing that is important is learning. Right, one can't just handcraft everything, because we need to be robust with scenarios that we have never seen before. We need holistic models to reason many things. At the end of the day, we have fixed computation for many things, many tasks, and we need to think of hardware at the same time.

If we want to get rid of the LiDAR, get rid of the maps, one of the things that we need to do is apply deep learning to have is robust, good, and fast, stereo 3D reconstruction. This can run real-time and after forty meters can basically almost replace the LIDAR.

Other things that you need to do is work on perception. You spend the past year and a half obsessed with instance segmentation. This is where you're segmenting the image. The idea is that you have a single image and you are interested in labeling every pixel but not just with the category of car, road, but also you want to estimate - this is one car, this is another car, etc... And this is a particularly difficult problem for deep learning because the loss function is agnostic, dupe or imitation. So we've built some interesting technology lately based on the what they should transform. It scales really well. It's independent of the number of objects so you can run real-time for anything. And this is triangularization. It's trained in a set of cities and tested in another set of cities. You see the prediction in the middle and the ground truth on the right. Okay so, even with crowded scenes [the model] can actually do pretty well.

Now, if you want to do self-driving, labeling pixels is not going to get you there. Right, so you need to really estimate what's happening everywhere in the scene. This is our latest, greatest results during detection and tracking. This is actually very technically interesting. You can bug propagate through solvers. And here, you see the results of what we have as well.

In general, what you want to do is estimate everything that is in the scenes. So here, we have some results that we had even a couple of years ago, with a single camera mounted on top of the car. The car is driving in intersections it has never seen before and is able to estimate the local map of the intersection. It is creating the map on the fly. It is estimating, whether your car is doing localization as well as estimating where every car is in this scene. And the traffic situation that you see on the bottom left, even though it doesn't see traffic scenes or things like that. So the cars that are color-coded in varying intentions. Basically, here we are estimating where everybody is going in the next couple of seconds. And this is as I said, [with a] single camera [and] new scenarios that we haven't trained.

Other things that you need to do is localization. Localization is an interesting problem, because typically the ways zone is that same way with us. If you go around and then you collect how the world looks like and that's really expensive, meaning that basically you need to know the appearance of the world that [the cars] are in every point in time.

It takes thirty-five seconds of driving to actually localize with a precision of 2 meters

We look at a cartographic environment and the motion of the vehicle to estimate really quickly where the vehicle is in the global coordinate system. Okay, so you see here, so you have a probability distribution over the graph of the road. The vehicles are driving, you have a few miles of the distribution and very quickly we know exactly where this vehicle is.

This is a Manhattan-like scenario, there are two miles of the distribution but again soon we are going to do something where there is only a single location. And this for the whole city of Kalser (NJ) which is two thousand kilometers of road. It takes thirty-five seconds of driving to actually localize with a precision of 2 meters, which is the precision of the maps that we use. These maps are available for free online for sixty percent of the world. So you can just download, you don't need to capture anything; it's free.

Now, in terms of mapping rights, why do car companies or self-driving car, or players use maps? You can think of a map as a sensor, which basically tells you the static part of the scene. It gives you robustness and it allows you to only look at the dynamic objects.

The problem with the way the mapping is done is that you have, say one of these cars with these expensive sensors, and basically you drive around the world, you have your data and then there is some labeling process where you basically say where are the roads, where are the lanes, where are the possible places where can park, etc. Okay, that makes you have very small coverage, because this is at the vehicle level and is very expensive. As an academic I look at "Can we actually do this by spending zero dollars?"

In those terms, we figure you can use aerial images or satellite images. Satellites pass around the earth twice a day so you have this up-to-date view of the world. And we create methods that can automatically extract the HD maps of the form that you see on the top where you have lanes, parking spots, sidewalks, etc. Yes, automatically it takes only 3 seconds in a single computer to get to estimate this perpendicular road. Basically, with a very small cluster of computers, you can run the whole world having up-to-date estimates.

© Robert Wright/LDV Vision Summit 2017

© Robert Wright/LDV Vision Summit 2017

Five and a half years ago, I created KITTI. And one thing that's bugged me about mapping is that is only the players, the companies, that are actually working on this. So, I created Toronto city. This is about to go online soon. The greater Toronto area is twenty percent of the population of Canada; it's huge, and we have all these different views: panoramas, LiDAR, cameras from the area views, drones, etc.

Now, as an academic, I cannot pay Labelers to label [the images]. Just the aerial images are going to cost between twenty to thirty million dollars to label it. What I did was I went to the government and I put all this information from maps that the government has captured through 3D maps of the city, every single building, etc. And then basically, with the veil of algorithms that can align the sources of information including all the different sources of imagery as well as the maps and automatically created ground truth. And here you see the quality of the ground truth is really, really, good. Now, we have ground truth for the whole Greater Toronto Area and we're gonna put online the benchmark where it sends. So this area is the tasks that you can participate with, for instance, semantic segmentation.

A little thing that we have built since then is also implementing ways to be able to extract these maps automatically. You can see from aerial images and one of the things that were interesting is from the panoramas, you can actually get automatically centimeter-accurate maps. That was actually quite interesting. Alright, to conclude, in the last seven years, my group has been working on ways to make affordable self-driving cars that scale with a sense and perception, localization, and mapping. Thank you.