Computers Still Have a Long Way to Go on Visual Reasoning According to Larry Zitnick of Facebook

Larry Zitnick, AI Research, Research Lead, Facebook

Larry Zitnick, AI Research, Research Lead, Facebook

Join us at the next annual LDV Vision Summit.  This is transcript of the keynote by Larry Zitnick, AI Research Lead at Facebook, “A Visual Stepping Stone to Artificial Intelligence” from our 2016 LDV Vision Summit.

Larry Zitnick got his PhD at CMU and after that he went to Microsoft Research where he established an excellent track record in object recognition and other parts of computer vision. Now he's at Facebook and, again, he's doing world class research. He's the leader of a very influential project called COCO which is Common Objects in Context and he also works at the intersection of images and language - which is an exciting area involving things like visual question answering.

At Facebook AI Research, what we try to do is advance state of the art AI in collaboration with the rest of community. Given that my background is in computer vision, I find myself thinking a lot, what is the role of computer vision in AI? That's what I want to talk about today.

Imagine that you could go back to 1984 and you could find yourself a graduate student and you said “here, read this paper. It's got a really cool title, it's called Neocognitron and if you want to solve recognition, all you need to do is three things. You need to figure out how to learn weights, you need to collect a huge amount of data, and you need to find a really fast computer and you would solve recognition.”

Now graduate students being graduate students, they would go in and they'd look at, they'd look at the weights part and they'd look at algorithms and they'd say “I would want to solve the algorithm.” That's exactly what they did.

They went and solved the algorithm. They developed Backprop. Now, graduate students being graduate students, say, “now all we have to do is collect more data.”

That took a lot longer unfortunately. That took maybe another 30 years to finally collect enough data to then to do the learning.

Now we're in 2016 and we find ourselves asking the question, how are we going to solve AI? Which direction do we need to go in to solve AI? Well, the answer is obvious. All we need is more data, more compute and apply it to Backprop. This is exactly what we've done the last few years. We took a problem which is seemingly AI complete, image captioning, and we basically took tons of data, tons of compute and we ran it on these images and we got some really amazing results.

A man riding a wave on a surfboard in the water. Great image, great caption. A giraffe standing in the grass next to a tree. Again, fantastic caption for this image and I think a lot of people were really excited about this. Then after a little bit of introspection, we began to realize, this doesn't work if you don't have similar images in the dataset. If the images are too unique, suddenly the algorithm starts falling apart.

How many people have read the paper, Building Machines that Learn and Think Like People?It is a paper from NYU, MIT, and Harvard. It's great read. If you haven't read it yet, please read it. They took this state of the art image caption generator and they ran these images through it and they got a man riding a motorcycle on the beach. Yeah, kind of correct, but kind of missed the point all together. You see this over and over again. If the test image is from the head, you nail it. If it's from the tail, it's a little bit unique and it completely falls apart. More data is not the solution.

Then as computer vision people you might think to yourself, “we want to solve AI which direction should we push in?” Let's just make our recognition problems harder. What's something really hard that we can try to recognize? Mirrors. You have a mirror like in this image here, we nail it, we can do a really good job. Right? What about this image? Can you detect the mirror in this image? In order to do this, you have to have a much more deep understanding about the world. You need to understand how selfies are actually taken. Some of the older people here might not get it.

Unfortunately finding really difficult images like this is really hard. It's already hard enough to create datasets, so I don't think this is the right direction either. If we want to solve AI, which direction should we go in? That's what I want to talk about. There's two things we need to do.

The first thing is learning. There's many different types of learning. There's the very friendly nice type of learning, which is supervised. Where you get data, it's complete. It doesn't have any noise in it, it's fantastic. It's our favorite friend.

You have semi-supervised learning which is a little lazy, let's say, where you don't always get the data that you want. You have reinforcement learning which is always trying to give you money or give you rewards for doing the right thing. Then you have unsupervised learning which is really, really annoying. There is a huge amount of unsupervised learning, we have a huge amount of data but we don't have any labels for it.

Supervised learning. This is our bread and butter. This is why we've had the advances we've had so far because of huge supervised datasets like Imagenet and more recently COCO.

-Larry Zitnick

Supervised learning. This is our bread and butter. This is why we've had the advances we've had so far because of huge supervised datasets like Imagenet and more recently COCO. But, creating these datasets is incredibly difficult and frustrating. Just ask anybody here who's tried to do this. Ask the graduate students. It's really hard to get graduate students to work on problems like this.

Semi-supervised learning. Let me give you an example of semi-supervised learning and why this is tough. If you want to learn a concept such as yellow and you have a bunch of image captions, you can identify images which have yellow in them and the caption actually mentions yellow. But there are other times where almost the entire image is yellow yet the caption doesn't actually mention the fact that there is any yellow in the image.

Now you can learn really cool things using data like this. You can learn whether people actually say a certain concept is present in an image or not. We can learn a classifier which says ‘hey, when there's a fence in the background of a soccer game, nobody ever mentions a fence.’ Where as if there's a fence that is blocking a bear from eating you, somebody is going to be mentioning that fence.

Reinforcement learning. Now reinforcement learning in computer vision is kind of a weird mismatch right now because in reinforcement learning, generally, you have some sort of interactive environment and it's hard to do that with the static datasets that we're used too. What you find is a lot of reinforcement learning is being done with gaming type interfaces or with interactive interfaces. I think it's still a really exciting area to be looking into.

Then you have unsupervised learning. This is kind of the elephant in the room because there's a huge amount of data. If we can figure out a way to learn features using this unsupervised data, we could do amazing things. People have been trying to propose all sorts of different tasks they can do. And they see what type of visual features they can learn from doing these sort of tasks. It has worked kind of okay but still not as good as supervision.

Tickets for our annual 2017 LDV Vision Summit are on sale
May 24-25 in NYC

The next thing I want to talk about is reasoning, and specifically visual reasoning but not about the task of reasoning itself. What I want to give you is a sense for how difficult this problem is, and where we are as a community in solving reasoning.

Very recently, there's a paper that proposed the following tasks. You're given three statements. Mary went into the hallway. John moved to the bathroom. Mary traveled to the kitchen. You have to answer a very simple question, where is Mary? Computers have a really hard time of answering that question. Let's let that sink in.

This is trivial. Really trivial yet computers can not do it because they can't understand what these statements are actually saying - and people are worried about AI taking over the world.

If you're going to do reasoning, you need to be able to predict the world. You need to be able to see how things are going to be able to progress into the future. Where are we right now? Right now, when we think about prediction, we're dealing with these sort of baby tasks where we have, for instance, we have a bunch of blocks that are stacked up on top of each other. All you have to do is predict, are those blocks going to fall over or not? It's incredibly simple. If they do fall over, which way are they going to fall? This is something that can be done by a baby. Yet this is a state of the art in the research right now. You think about more complex prediction tasks where we model human behavior, where we have to simulate driving down roads and that sort of thing. We still have a long way to go.

Data. This is something that's interesting because when you think about these AI tasks that we're looking at, a lot of them are dealing at a higher level of reasoning. They're not looking at pixel level things. It doesn't matter if you start with real data or you start with more abstract data. There's been a lot more work in looking at abstract scenes. Cartoon scenes. Looking at Atari games. Looking at Minecraft. These other areas where we can isolate the reasoning problem that we want to explore without having to worry about this messiness, that is the recognition problem in the real world.

Finally, even if we've solved reasoning, how would we know that we solved it? We all know the problems with the Turing Test and how incredibly frustrating it is to measure intelligence based upon the Turing Test because there are all sorts of different ways of gaming it. One of the more recent things that we've proposed is to use visual question answering as a sort of Turing Test for reasoning and vision. What you do is you have an image, you have a question, and it combines the visual recognition - that is now beginning to work better. If you can do both of them well then you can do good on the VQA task. So far, what we've seen, is progress in this task hasn't been moving that quickly and I think a lot of it is due to the fact that reasoning is not progressing that quickly unlike recognition.

Looking forward. Up until 2016 we've made incredible strides in recognition. I said before that recognition is solved, but recognition is not solved, there is still a lot more work to be done. Only compared to 1984, is it essentially solved. Now, if we actually want to solve AI, we need to turn. We can't just keep pushing on recognition. We can't keep thinking that AI is recognition. We need to start thinking of AI as AI and start solving these problems that have been ignored over the last thirty years.

If you look at reasoning in particular, we're just at the beginning stages of this and for me this is what's so exciting. There's still so much work to be done. High level, what's interesting is there's not a clear road map. We don't know how reasoning is going to be solved. We don't know how learning is going to be solved. We don't know how we're going to crack the unsupervised learning problem and because of this it's hard to give a time frame.

One thing I can guarantee is, as we explore AI, computer vision is going to be our playground for research in this area. Thank you.

The annual LDV Vision Summit will be occurring on May 24-25, 2017 at the SVA Theatre in New York, NY.