If we look at typically what is going on in self-driving car companies we find: they're pretty good at localization, path planning, and obstacle avoidance, but there are two things that they do which actually made them not super scalable. The first thing is LIDAR, the prices are dropping, but it is still quite expensive to buy a decent LIDAR. And the other thing, which is the been in the closet, is actually mapping.
What I have been working for the past seven years is how to make solutions that are scalable, meaning cheap sensors and trying to drive without maps or with as little prior knowledge as possible.
Now if you want to do something of this form, we need to think about many different things at once. The first thing that us at academic was difficult was data and so we created many years ago, I guess, it's still the only benchmark for self-driving which is KITTI. And to my despair, this is still the only benchmark, which I don't understand.
If we want to get rid of the LiDAR, get rid of the maps, one of the things that we need to...have is robust, good, and fast, stereo 3D reconstruction.
The other thing that is important is learning. Right, one can't just handcraft everything, because we need to be robust with scenarios that we have never seen before. We need holistic models to reason many things. At the end of the day, we have fixed computation for many things, many tasks, and we need to think of hardware at the same time.
If we want to get rid of the LiDAR, get rid of the maps, one of the things that we need to do is apply deep learning to have is robust, good, and fast, stereo 3D reconstruction. This can run real-time and after forty meters can basically almost replace the LIDAR.
Other things that you need to do is work on perception. You spend the past year and a half obsessed with instance segmentation. This is where you're segmenting the image. The idea is that you have a single image and you are interested in labeling every pixel but not just with the category of car, road, but also you want to estimate - this is one car, this is another car, etc... And this is a particularly difficult problem for deep learning because the loss function is agnostic, dupe or imitation. So we've built some interesting technology lately based on the what they should transform. It scales really well. It's independent of the number of objects so you can run real-time for anything. And this is triangularization. It's trained in a set of cities and tested in another set of cities. You see the prediction in the middle and the ground truth on the right. Okay so, even with crowded scenes [the model] can actually do pretty well.
Now, if you want to do self-driving, labeling pixels is not going to get you there. Right, so you need to really estimate what's happening everywhere in the scene. This is our latest, greatest results during detection and tracking. This is actually very technically interesting. You can bug propagate through solvers. And here, you see the results of what we have as well.
In general, what you want to do is estimate everything that is in the scenes. So here, we have some results that we had even a couple of years ago, with a single camera mounted on top of the car. The car is driving in intersections it has never seen before and is able to estimate the local map of the intersection. It is creating the map on the fly. It is estimating, whether your car is doing localization as well as estimating where every car is in this scene. And the traffic situation that you see on the bottom left, even though it doesn't see traffic scenes or things like that. So the cars that are color-coded in varying intentions. Basically, here we are estimating where everybody is going in the next couple of seconds. And this is as I said, [with a] single camera [and] new scenarios that we haven't trained.
Other things that you need to do is localization. Localization is an interesting problem, because typically the ways zone is that same way with us. If you go around and then you collect how the world looks like and that's really expensive, meaning that basically you need to know the appearance of the world that [the cars] are in every point in time.
It takes thirty-five seconds of driving to actually localize with a precision of 2 meters
We look at a cartographic environment and the motion of the vehicle to estimate really quickly where the vehicle is in the global coordinate system. Okay, so you see here, so you have a probability distribution over the graph of the road. The vehicles are driving, you have a few miles of the distribution and very quickly we know exactly where this vehicle is.
This is a Manhattan-like scenario, there are two miles of the distribution but again soon we are going to do something where there is only a single location. And this for the whole city of Kalser (NJ) which is two thousand kilometers of road. It takes thirty-five seconds of driving to actually localize with a precision of 2 meters, which is the precision of the maps that we use. These maps are available for free online for sixty percent of the world. So you can just download, you don't need to capture anything; it's free.
Now, in terms of mapping rights, why do car companies or self-driving car, or players use maps? You can think of a map as a sensor, which basically tells you the static part of the scene. It gives you robustness and it allows you to only look at the dynamic objects.
The problem with the way the mapping is done is that you have, say one of these cars with these expensive sensors, and basically you drive around the world, you have your data and then there is some labeling process where you basically say where are the roads, where are the lanes, where are the possible places where can park, etc. Okay, that makes you have very small coverage, because this is at the vehicle level and is very expensive. As an academic I look at "Can we actually do this by spending zero dollars?"
In those terms, we figure you can use aerial images or satellite images. Satellites pass around the earth twice a day so you have this up-to-date view of the world. And we create methods that can automatically extract the HD maps of the form that you see on the top where you have lanes, parking spots, sidewalks, etc. Yes, automatically it takes only 3 seconds in a single computer to get to estimate this perpendicular road. Basically, with a very small cluster of computers, you can run the whole world having up-to-date estimates.