With new and ever larger multimodal models being announced seemingly every week, it can be hard to keep up with the latest developments in computer vision and machine learning. Dr. Serge Belongie, a professor of Computer Science at the University of Copenhagen, the head of the Danish Pioneer Centre for Artificial Intelligence and LDV Capital Expert in Residence, provided a brief overview of some of the key concepts behind these advances at our 11th Annual LDV Vision Summit earlier this year.

Serge Belongie earned a B.S. in EE from Caltech (1995) and a Ph.D. in EECS from Berkeley (2000). He was a professor of Computer Science at UC San Diego (2001–2013) and co-founded Digital Persona (the first mass-market fingerprint ID device), Anchovi Labs (acquired by Dropbox), and Orpix (an image recognition framework). He was named to MIT Technology Review’s Innovators Under 35 in 2004 and has received multiple honors, including the Marr Prize Honorable Mention (2007), ICCV Helmholtz Prize (2015), NSF CAREER Award and Alfred P. Sloan Research Fellowship. Check out our interview, “The Difference Is – Now The Whole World Is Paying Attention To AI.”

Serge has played a key role in organizing our flagship event – Annual LDV Vision Summit since 2014.

Watch the video recording or read our lightly edited transcript below.

A quick look at what’s going on with state-of-the-art generative AI: what works and what doesn’t

Let’s start with the table stakes – what you need to get into the game with large models, whether they’re large language models or large vision models. One thing you’ll see across many applications is language models trained using a technique called masking. You’ve probably heard of BERT. In that case, what you're working with are tokens – typically words – and you train large neural networks by erasing some of those words and having the model learn to fill them back in.

The model’s "intelligence," such as it is, is based on its ability to predict missing tokens. In the context of images, instead of erasing words, you erase patches of pixels. You then train the model to reconstruct the missing information. The masking could involve scrambling, removing color, or deleting patches entirely – and the neural nets learn to complete the image.

This is the foundation of the modern generative AI revolution, including tools like ChatGPT. These models develop powerful internal representations that allow them to predict the next token in a sequence. When we type prompts into chatbots, we’re relying on these learned representations to generate human-like, coherent text.

These systems are prone to hallucination.

Another major trend involves diffusion models. If language models are focused on next-token prediction, diffusion models are focused on next-noise prediction.

Diffusion models use a different kind of masking. Instead of removing information, they add noise and then learn to reverse that process. Essentially, they figure out how to denoise an input to arrive at a desired output, such as an image or a scene, guided by text or image embeddings.

This approach has given rise to the photorealistic generative techniques we've seen recently. And in fact, many modern systems combine both next-token prediction and next-noise prediction to get the best of both worlds. These are powerful tools – but again, they're still prone to hallucinations or confabulations. And when that happens, we need ways to steer and control them.

That’s where retrieval augmentation comes in.

You’ve likely heard of this as RAG – Retrieval-Augmented Generation. In the context of text, this means having a large set of documents on hand – maybe saved locally or sourced live – and using those documents to influence the generated responses. For example, this enables references to real journal articles or accurate stylistic transfers in images, all through retrieval augmentation.

In this setup, the large language or vision model provides the fluency, while the retrieval layer provides grounding – anchoring outputs to real, verifiable information.

This brings us to the current evolution: agentic AI.

This is where things get complex. When you combine a language model with retrieval augmentation, you begin to handle the variety of tasks needed in enterprise settings or research labs. But what becomes clear is that there’s no single solution – no “AGI” that does it all.

Instead, we rely on traditional software engineering tools – essentially building an app store of capabilities, each tailored to a specific need.

And finally, we arrive at embodied AI.

If we’re serious about planning and reasoning, then text and passive video won’t cut it. That’s where things like physics simulations come in – like those from the Genesis team or the robotic simulations that Jan discussed from NVIDIA. These systems need to interact with environments. Tokens, text and passive observation just aren’t enough. For real planning and real reasoning, we need highly realistic, physically accurate simulations.

Large multimodal models – language and vision – are incredibly powerful. They give these systems fluency and the appearance of intelligence, but they still require careful steering. That steering can come through retrieval augmentation, or through structured, agentic software frameworks. If we want to push toward robotic implementations, we’ll need to keep advancing the realism of our simulations.

LDV Capital is a thesis-driven early-stage venture fund investing in people building businesses powered by visual technology and artificial intelligence. We thrive on collaborating with deep tech teams that leverage computer vision, machine learning and artificial intelligence to analyze visual data. We have been investing in pre-inc, pre-seed and seed-stage teams across North America and Europe since 2012.

Explore our portfolio, including recent additions: Fieldstone Bio (synthetic biology), Dannce AI (neuroscience), and ResiQuant (insurance).

One vertical we’re particularly excited about is materials science, a field historically slowed by trial-and-error discovery. Today, AI and machine learning are accelerating breakthroughs, enabling faster, more creative development of molecules, chemicals, and materials. Generative AI now considers not just composition, but processing, form and real-world constraints. At our 11th Annual LDV Vision Summit, experts from Duke University, IBM, Mitra Chem and Argonne National Lab explored how visual tech and AI are reshaping materials discovery.

Let us know if you are either thinking of building or have already started building a startup leveraging visual technologies and AI in the materials discovery & development space.

LDV Blog