Picture This: Images Created Using Natural Language Processing

Stephen DeAngelis

January 12, 2021

The phrase “a picture is worth a thousand words” sounds like something William Shakespeare might have written; however, according to Gary Martin (@aphraseaweek), the idiom can be traced to more modern sources.[1] He writes:

This phrase emerged in the USA in the early part of the 20th century. Its introduction is widely attributed to Frederick R. Barnard, who published a piece commending the effectiveness of graphics in advertising with the title ‘One look is worth a thousand words’, in Printer’s Ink, December 1921. Barnard claimed the phrase’s source to be oriental by adding ‘so said a famous Japanese philosopher, and he was right’. Printer’s Ink printed another form of the phrase in March 1927, this time suggesting a Chinese origin: ‘Chinese proverb. One picture is worth ten thousand words.’ The arbitrary escalation from ‘one thousand’ to ‘ten thousand’ and the switching from Japan to China as the source leads us to smell a rat with this derivation. In fact, Barnard didn’t introduce the phrase – his only contribution was the incorrect suggestion that the country of origin was Japan or China. This has led to another popular belief about the phrase, that is, that it was coined by Confucius. It might fit the Chinese-sounding ‘Confucius he say’ style, but the Chinese derivation was pure invention. … Who it was that married ‘worth ten thousand words’ with ‘picture’ isn’t known, but we do know that the phrase is American in origin. It began to be used quite frequently in the US press from around the 1920s onward. The earliest example I can find is from the text of an instructional talk given by the newspaper editor Arthur Brisbane to the Syracuse Advertising Men’s Club, in March 1911: ‘Use a picture. It’s worth a thousand words.'”


The origin of the phrase is not as important as the message it conveys. Wikipedia notes, “‘A picture is worth a thousand words’ is an English language adage meaning that complex and sometimes multiple ideas can be conveyed by a single still image, which conveys its meaning or essence more effectively than a mere verbal description.”[2] A new program introduced by OpenAI, called DALL·E, takes the adage to heart and creates images from words.




Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, and Scott Gray, from OpenAI, write, “DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs. We’ve found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.”[3] If you are unfamiliar with GPT-3, Wikipedia describes it this way:


Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. It is the third-generation language prediction model in the GPT-n series (and the successor to GPT-2) created by OpenAI, a San Francisco-based artificial intelligence research laboratory. GPT-3’s full version has a capacity of 175 billion machine learning parameters. GPT-3, which was introduced in May 2020, is part of a trend in natural language processing (NLP) systems of pre-trained language representations. Before the release of GPT-3, the largest language model was Microsoft’s Turing NLG, introduced in February 2020, with a capacity of 17 billion parameters or less than 10 percent compared to GPT-3. The quality of the text generated by GPT-3 is so high that it is difficult to distinguish from that written by a human, which has both benefits and risks. Thirty-one OpenAI researchers and engineers presented the original May 28, 2020 paper introducing GPT-3. In their paper, they warned of GPT-3’s potential dangers and called for research to mitigate risk. David Chalmers, an Australian philosopher, described GPT-3 as ‘one of the most interesting and important AI systems ever produced.'”[4]


Ramesh and his colleagues explain, “We decided to name our model using a portmanteau of the artist Salvador Dalí and Pixar’s WALL·E.”


What can DALL·E do?


Ramesh and his colleagues observe, “Like GPT-3, DALL·E is a transformer language model. It receives both the text and the image as a single stream of data containing up to 1280 tokens, and is trained using maximum likelihood to generate all of the tokens, one after another.” They go to explain, “A token is any symbol from a discrete vocabulary; for humans, each English letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16384, and the image is represented using 1024 tokens with a vocabulary size of 8192.” The images used by DALL·E come from another OpenAI program called CLIP (Contrastive Language-Image Pre-training). Using descriptions and CLIP images, Ramesh and his colleagues explain, “DALL·E is able to create plausible images for a great variety of sentences that explore the compositional structure of language.”


Will Douglas Heaven (@strwbilly), senior editor for AI at MIT Technology Review, writes, “For all GPT-3’s flair, its output can feel untethered from reality, as if it doesn’t know what it’s talking about. That’s because it doesn’t. By grounding text in images, researchers at OpenAI and elsewhere are trying to give language models a better grasp of the everyday concepts that humans use to make sense of things. DALL·E and CLIP come at this problem from different directions. At first glance, CLIP (Contrastive Language-Image Pre-training) is yet another image recognition system. Except that it has learned to recognize images not from labeled examples in curated data sets, as most existing models do, but from images and their captions taken from the internet. It learns what’s in an image from a description rather than a one-word label such as ‘cat’ or ‘banana.’ … The results are striking, though still a mixed bag.”[5] To see examples of the images created by DALL·E, clink on this link to the OpenAI article.


Heaven was particularly enthralled with image DALL·E created from the caption “an armchair in the shape of an avocado.” He notes, “The armchairs in particular all look like chairs and avocados.” Ramesh told him, “The thing that surprised me the most is that the model can take two unrelated concepts and put them together in a way that results in something kind of functional.” As amazing as this new program is, Heaven notes, “DALL·E already shows signs of strain. Including too many objects in a caption stretches its ability to keep track of what to draw. And rephrasing a caption with words that mean the same thing sometimes yields different results. There are also signs that DALL·E is mimicking images it has encountered online rather than generating novel ones.”


Concluding thoughts


Bryan Walsh (@bryanrwalsh), the Future Correspondent for Axios, insists DALL·E is a big step forward in the field of artificial intelligence. He writes, “The new models are the latest steps in ongoing efforts to create machine learning systems that exhibit elements of general intelligence, while performing tasks that are actually useful in the real world — without breaking the bank on computing power. … Like GPT-3, the new models are far from perfect, with DALL·E, in particular, dependent on exactly how the text prompt is phrased if it’s to be able to generate a coherent image. The bottom line: Artificial general intelligence may be getting closer, one doodle at a time.”[6] Even though DALL·E may have limitations, it’s still pretty remarkable that it can pictures from words — and those pictures are worth more than a thousand words. In the case of this article, over 1300 of them.


[1] Gary Martin, “The meaning and origin of the expression: A picture is worth a thousand words,” The Phrase Finder.
[2] “A picture is worth a thousand words,” Wikipedia.
[3] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, and Scott Gray, “DALL·E: Creating Images from Text,” OpenAI, 5 January 2021.
[4] “GPT-3,” Wikipedia.
[5] Will Douglas Heaven, “This avocado armchair could be the future of AI,” MIT Technology Review, 5 January 2021.
[6] Bryan Walsh, “A new AI model draws images from text,” Axios, 5 January 2021.