Big Data and Language

Stephen DeAngelis

April 20, 2012

Since Enterra Solutions® uses an ontology in most of its solutions, the topic of language is of interest to me both personally and professionally. That’s why two recent articles caught my attention. The first article discusses how Big Data is being used to discover how the use of words has changed over time. The second article talks about how some executives are taking courses aimed at making them more literate in the language of IT.

In the first article, Christopher Shea asks, “Can physicists produce insights about language that have eluded linguists and English professors?” [“The New Science of the Birth and Death of Words,” Wall Street Journal, 16 March 2012] To answer that question, a team of physicists used Big Data analytics to search for insights from “Google’s massive collection of scanned books.” The result: “They claim to have identified universal laws governing the birth, life course and death of words.” The team reported its findings in an article published in the journal Science. Shea continues:

“The paper marks an advance in a new field dubbed ‘Culturomics’: the application of data-crunching to subjects typically considered part of the humanities. Last year a group of social scientists and evolutionary theorists, plus the Google Books team, showed off the kinds of things that could be done with Google’s data, which include the contents of five-million-plus books, dating back to 1800.”

Whether or not you are interested in linguistics, this effort demonstrates how powerful Big Data techniques can be for producing new insights. According to Shea, the team’s research “gave the best-yet estimate of the true number of words in English—a million, far more than any dictionary has recorded (the 2002 Webster’s Third New International Dictionary has 348,000).” Shea continues:

“More than half of the language, the authors wrote, is ‘dark matter’ that has evaded standard dictionaries. The paper also tracked word usage through time (each year, for instance, 1% of the world’s English-speaking population switches from ‘sneaked’ to ‘snuck’). It also showed that we seem to be putting history behind us more quickly, judging by the speed with which terms fall out of use. References to the year ‘1880’ dropped by half in the 32 years after that date, while the half-life of ‘1973’ was a mere decade.”

This demonstrates the increasing velocity of new knowledge as well as the importance of storing old knowledge. I’m a fan of history and Big Data techniques may eventually help us paint a truer, less biased, history of the world. I’m also a fan of the future and I know that Big Data techniques will help us make that future better. Shea continues:

“In the new paper, Alexander Petersen, Joel Tenenbaum and their co-authors looked at the ebb and flow of word usage across various fields. ‘All these different words are battling it out against synonyms, variant spellings and related words,’ says Mr. Tenenbaum. ‘It’s an inherently competitive, evolutionary environment.'”

I’m reminded of President Andrew Jackson’s quote, “It’s a damn poor mind that can think of only one way to spell a word!” He was joined in that sentiment by Mark Twain, who wrote, “I don’t give a damn for a man that can only spell a word one way.” I suspect those sentiments are also shared by former U.S. Vice President Dan Quayle who once famously “corrected” elementary student William Figueroa’s spelling of “potato” to the incorrect “potatoe” at a spelling bee. Shea continues:

“When the scientists analyzed the data, they found striking patterns not just in English but also in Spanish and Hebrew. There has been, the authors say, a ‘dramatic shift in the birth rate and death rates of words’: Deaths have increased and births have slowed. English continues to grow—the 2011 Culturonomics paper suggested a rate of 8,500 new words a year. The new paper, however, says that the growth rate is slowing. Partly because the language is already so rich, the ‘marginal utility’ of new words is declining: Existing things are already well described. This led them to a related finding: The words that manage to be born now become more popular than new words used to get, possibly because they describe something genuinely new (think “iPod,” “Internet,” “Twitter”).”

Although the scientists claim that “higher death rates for words … are largely a matter of homogenization,” I wonder if it isn’t also a matter of there being more specialized and less generalized education. Shea continues:

“The explorer William Clark (of Lewis & Clark) spelled ‘Sioux’ 27 different ways in his journals (‘Sieoux,’ ‘Seaux,’ ‘Souixx,’ etc.), and several of those variants would have made it into 19th-century books. Today spell-checking programs and vigilant copy editors choke off such chaotic variety much more quickly, in effect speeding up the natural selection of words.”

Of course, spell checkers aren’t perfect. An anonymous poet penned the following poem to make that point:

I have a spelling checker
It came with my PC
It plainly marks for my revue
Mistakes I cannot sea
I’ve run this poem threw it
I’m sure your pleased to no,
It’s letter perfect in it’s weigh
My checker tolled me sew.

Shea reports that the database analyzed by the scientists “does not include the world of text- and Twitter-speak, so some of the verbal chaos may just have shifted online.” He continues:

“Synonyms also fight Darwinian battles. In one chart, the authors document that ‘Roentgenogram’ was by far the most popular term for ‘X-ray’ (or ‘radiogram,’ another contender) for much of the 20th century, but it began a steep decline in 1960 and is now dead. (‘Death,’ in language, is not as final as with humans: It refers to extreme rarity.) ‘Loanmoneys’ died circa 1950, killed off by ‘loans.’ ‘Persistency’ today is breathing its last, defeated in the race for survival by ‘persistence.’ The authors even identified a universal ‘tipping point’ in the life cycle of new words: Roughly 30 to 50 years after their birth, they either enter the long-term lexicon or tumble off a cliff into disuse. The authors suggest that this may be because that stretch of decades marks the point when dictionary makers approve or disapprove new candidates for inclusion. Or perhaps it’s generational turnover: Children accept or reject their parents’ coinages.”

What I found interesting was that the scientists discovered a “similar trajectory of word birth and death across time in three languages.” Even so, they concluded that the field “is still too new to evaluate fully.” As is normally the case, not everyone agrees with the conclusions reached by the team. Academics love arguing amongst themselves. Shea reports:

“Among the questions raised by critics: Since older books are harder to scan, how much of the word ‘death’ is simply the disappearance of words garbled by the Google process itself? In the end, words and sentences aren’t atoms and molecules, even if they can be fodder for the same formulas.”

In our work at Enterra, we understand that every discipline develops its own special lexicon. That’s why we work hard to ensure that our ontology understands words in various settings. The IT world is no different when it comes to creating a specialized language that can sound foreign to the technology-challenged. Jonathan Moules reports that some executives are taking courses to help them understand this specialized lexicon. [“Coding as a second language,” Financial Times, 28 March 2012] He reports:

“Alliott Cole sees a large number of tech start-ups in his work as principal in the early-stage investment team of private equity firm Octopus. The trouble is that he often struggles to comprehend what those writing the software that underpins those companies are talking about. ‘For several years I have worked hard to understand how new infrastructure, products and applications work together to disrupt markets,’ he says, explaining why he recently decided to take a course that claims to be able to teach even the most IT-illiterate person how to create a software application, or app, in just a day. ‘While [I am] conversant in many of the trends and the – often confusing – array of terminology, it troubled me that I remained an observant passenger rather than an active driver, particularly in the realms of computer programming.’ Mr Cole is not alone.”

The course taken by Cole was “created by three former advertising executives – Steve Henry, Kathryn Parsons and Richard Peters – and Alasdair Blackwell, an award-winning web designer and developer, because they felt there was “a widely felt, but rarely discussed, problem. Tech talk is increasingly commonplace in business and life … but most people, including senior executives, find the language used by software engineers, social media professionals and the ‘digital natives’ … baffling.” Moules reports that modern technology is changing many industries so even well-educated people need an occasional refresher to “revisit the basics of how technology functions.” After spending a day taking the course with a handful of executives, Moules reports that they all were “happy to leave [programming] to the experts – but now, at least, they feel more confident of being able to talk the same language.”

As the business world enters the age of Big Data, more specialized words are likely to be invented to describe technologies what cannot adequately be described using the current lexicon. There will also be words made up by marketing departments that will catch on. Only Big Data techniques, especially rule-based ontological analysis, are capable of making the connections and providing the insights that will help us make better decisions in the future — perhaps even decisions about the words we use.

Tariffs, Trade, and Times Ahead

During this U.S. presidential election year, you are likely to hear and read a lot about tariffs. The Economist notes, “Although it is unfashionable to

Artificial Intelligence: Try It You’ll Like It

Futurist Bernard Marr observes, “As a species, humanity has witnessed three previous industrial revolutions: first came steam/water power, followed by electricity, then computing. Now, we’re