LLM vs. LMM (Acronyms Are Fun)

Document processing with GPT-4V. The model’s mistake is highlighted in red. From https://huyenchip.com/2023/10/10/multimodal.html?utm_source=tldrai

I just ran across a use of “multimodal” that has nothing to do with fingers, faces, or irises. But it has everything to do with generative AI.

Earlier this week, Chip Huyen published “Multimodality and Large Multimodal Models (LMMs)” at his website huyenchip.com. He starts as follows:

For a long time, each ML (machine learning) model operated in one data mode – text (translation, language modeling), image (object detection, image classification), or audio (speech recognition).

However, natural intelligence is not limited to just a single modality. Humans can read and write text. We can see images and watch videos. We listen to music to relax and watch out for strange noises to detect danger. Being able to work with multimodal data is essential for us or any AI to operate in the real world.

From https://huyenchip.com/2023/10/10/multimodal.html?utm_source=tldrai

As you can see from the title, Huyen uses an acronym “LMM” that is very similar to another generative AI acronym, “LLM” (large language model).

So what’s the difference?

Not all multimodal systems are LMMs. For example, text-to-image models like Midjourney, Stable Diffusion, and Dall-E are multimodal but don’t have a language model component.

From https://huyenchip.com/2023/10/10/multimodal.html?utm_source=tldrai

If you’re interested in delving into the topic, Huyen’s long three-part post covers the context for multimodality, the fundamentals of a multimodal system, and active research areas.

You can find the post at https://huyenchip.com/2023/10/10/multimodal.html?utm_source=tldrai. And I guess you can figure out how I came across it.