I use both text generators (sparingly) and image generators (less sparingly) to artificially create text and images. But I encounter one image challenge that you’ve probably encountered also: bizarre misspellings.
This post includes an example, created in Google Gemini, that was created using the following prompt:
Create a square image of a library bookshelf devoted to the works authored by Dave Barry.
Now in the ideal world, my prompt would completely research Barry’s published titles, and the resulting image would include these book titles (such as Dave Barry Slept Here, one of the greatest history books of all time maybe or maybe not).
In the mediocre world, at least the book spines would include the words “Dave Barry.”
Gemini gave me nothing of the sort.

The bookshelf may as well contain books by Charles Dikkens, the well-known Dutch author.
Why can’t your image generator spell words properly?
It always mystified me that AI-generated images had so many weird words, to the point where I wondered whether the AI was specifically programmed to misspell.
It wasn’t…but it wasn’t programmed to spell either.
TechCrunch recently published an article in which the title was so good you didn’t have to read the article itself. The title? “Why is AI so bad at spelling? Because image generators aren’t actually reading text.”
This is something that I pretty much forgot.
- When I use an AI-powered text generator, it has been trained to respond to my textual prompts and create text.
- When I use an AI-powered image generator, it has been trained to respond to my textual prompts and create images.
Two very different tasks, as noted by Asmelash Teka Hadgu, co-founder of Lesan and a fellow at the DAIR Institute.
“The diffusion models, the latest kind of algorithms used for image generation, are reconstructing a given input,” Hagdu told TechCrunch. “We can assume writings on an image are a very, very tiny part, so the image generator learns the patterns that cover more of these pixels.”
The algorithms are incentivized to recreate something that looks like what it’s seen in its training data, but it doesn’t natively know the rules that we take for granted — that “hello” is not spelled “heeelllooo,” and that human hands usually have five fingers.
So what’s the solution?
We need LMM image-text generators
The solution is something I’ve talked about before: large multimodal models. Permit me to repeat myself (it’s called repurposing) and quote from Chip Huyen again.
For a long time, each ML (machine learning) model operated in one data mode – text (translation, language modeling), image (object detection, image classification), or audio (speech recognition).
However, natural intelligence is not limited to just a single modality. Humans can read and write text. We can see images and watch videos. We listen to music to relax and watch out for strange noises to detect danger. Being able to work with multimodal data is essential for us or any AI to operate in the real world.
So if we ask an image generator to create an image of a library bookshelf with Dave Barry works, it would actually display book spines with Barry’s actual titles.
So why doesn’t my Google Gemini already provide this capability? It has a text generator and it has an image generator: why not provide both simultaneously?
Because that’s EXPENSIVE.
I don’t know whether Google’s Vertex AI provides the multimodal capabilties I seek, where text in images is spelled correctly.
And even with $300 in credits, I’m not going to spend the money to find out. See Vertex AI’s generative AI pricing here.


3 Comments