Viorica Patraucean is a Research Scientist at Google DeepMind. At our Vision Summit 2018 she enlightened us with her recent work on massively parallel video nets and how it’s especially relevant for real world low-latency/low-power applications. Previously she worked on 3D shapes processing in the Machine Intelligence group of the Engineering Department in Cambridge, after completing a PhD in image processing at ENSEEIHT–INP Toulouse. Irrespective of the modality - image, 3D shape, or video - her goal has always been the same: design a system that comes closer to human perception capabilities.
As most of you here, I'm interested in making machines see the world the way we see it. When I say machines, I'm thinking of autonomous cars or robots or systems for augmented reality. These are all very different applications, of course, but, in many cases, they have one thing in common, they require low latency processing of the visual input. In our work, we use deep artificial neural networks which consist of a series of layers. We feed in an image, this image is processed by each layer of the network, and then we obtain a prediction, assuming that this is an object detector, and there's a cat there. We care about cats and all.
Just to make it clear what I mean by latency – I mean the time that passes between the moment when we feed in the image and the moment when we get the prediction. Here, obviously, the latency is just the sum of the computational times of all the layers in the network.
Now, it is common practice that, if we are not quite happy with the accuracy of the system, we can make the systems deeper by adding more layers. Because this increases the capacity of the network, the expressivity of the network, we get better accuracy. But this doesn't come for free, of course. This will lead to increasing the processing time and the overall latency of the system. Current object detectors run at around five frames per second, which is great, of course, but what does five frames per second mean in real world?
I hope you can see the difference between the two videos here. On the left, you see the normal video at 25 frames per second and, on the right, you see the five frames per second video obtained by keeping every fifth frame in the video. I can tell you, on the right, the tennis ball appears in two frames, so, if your detector is not perfect, it might fail to detect it. Then you're left to play tennis without seeing the ball, which is probably not ideal.
The question then becomes, how can we do autonomous driving at five frames per second, for example? One answer could be like this, turtle power. We all move at turtle speed, but probably that's not what we are after, so then we need to get some speed from somewhere.
One option, of course, is to rely on hardware. Hardware has been getting faster and faster in the past decade. However, the faster hardware normally comes with higher energy consumption and, without a doubt, on embedded devices, this is a critical constraint. So, what would be more sustainable alternatives to get our models faster?
Let's look at what the brain does to process a visual input. There are lots of numbers there. Don't worry. I'll walk you through them. I'm just giving a list of comparison between a generic artificial neural network and the human brain. Let's start by looking at the number of basic units, which in the brain are called neurons and their connections are called synapses.
Here, the brain is clearly superior to any model that we have so far by several orders of magnitude, and this could explain, for example, the fact that the brain is able to process so many things in parallel and to achieve such high accuracy. However, when we look at speed of basic operation, here we can see that actually our electronic devices are much faster than the brain, and the same goes for precision of computation. Here, again, the electronic devices are much more precise.
“Current systems consider the video as a collection of independent frames. And, actually, this is no wonder since the current video systems were initially designed as image models and then we just run them repeatedly on the frames of a video.”
However, as I said, speed and precision of computation normally come with high power consumption to the point where like a current GPU will consume about 10 times more than the entire human brain so. Yet, with all these advantages on the side of the electronic devices, we are still running at five frames per second when the human brain can actually run at more. The human brain can actually process more than 100 frames per second, so this points to the fact that.
I'm going to argue here that the reason for this suboptimal behavior comes from the fact that current systems consider the video as a collection of independent frames. And, actually, this is no wonder since the current video systems were initially designed as image models and then we just run them repeatedly on the frames of a video. By running them in this way, it means that the processing is completely sequential. Except, the processing that happens on GPU where we can parallelize things. Overall, it still remains sequential, and then, the older layers in the network, they all work at the same pace, and this is opposite to what the brain does.
There is a high evidence that the brain actually exhibits a massively parallel processing mode and also that the neurons fire at different frame rates. All this because the brain rightfully considers the visual stream as a continuous stream that exhibits high correlations and redundancy across time.
Just to go back to the initial sketch, this is how our current systems work. You get an image. This goes through every layer in the network. You get a prediction. The next image comes in. It goes again through every layer and so on. What you should observe is that, at any point in time, only one of these layers are working and all the others are just waiting around for their turn to come.
This is clearly not useful. It's just wasting resources, and the other thing is that everybody works at the same pace, and, again, this is not needed if we take, for example, in account the slowness principle, and I'm just trying to depict here what that means. This principle informally states that fast varying observations are explained by slow varying factors.
If you look at the top of the figure on the left - those are the frames of a video depicting a monkey. If you look at the pixel values in the pixel space, you will see high variations because of some light changes or the camera moves a bit or maybe the monkey moves a bit. However, if we look at more abstract features of the scene, for example, the identity of the object of the position of the object, this will change much more slowly.
Now, how is this relevant for artificial neural networks? It is quite well-understood now that deeper layers in an artificial neural network extract more and more abstract features, so, if we agree with the slowness principle, then it means that the deeper layers can work at a slower pace than the layers that are the input of the network.
Now, if we put all these observations together, we obtain something like this. We obtain like a Christmas tree, as shown, where all the layers work all the time, but they work at different rates, so we are pipelining operations, and this generates more parallel processing. We can now update our layers at different rates.
Initially, I said that the latency of a network is given by the sum of the computation times of all the layers in the network. Now, very importantly, with our design, the latency is now given by the slowest of the layers in the network. In practice, we obtain up to four times faster response. I know it's not the 10 times, but four is actually enough because, In perception, once you are past the 16 frames per second, then you are quite fine, I think.
We obtain this faster response with 50% less computation, so I think this is not negligible and, again, very important, we can now make our networks even deeper to improve their accuracy without affecting the latency of the network.
I hope I convinced you that this is a more sustainable way of creating a low latency video models, and I'm looking forward to the day where our models will be able to process everything that the camera can provide. I'm just showing here a beautiful video captured at 1,000 frames per second, I think this is the future.
Watch Viorica Patraucean’s keynote at our LDV Vision Summit 2018 below and checkout other keynotes on our videos page.
We are accepting applications to our Vision Summit Entrepreneurial Computer Vision Challenge for computer vision research projects and our Startup Competition for visual technology companies with <$2M in funding. Apply now & spread the word.