Big Data and Semantics

Stephen DeAngelis

January 17, 2013

“Perhaps when it comes to natural language processing and related fields,” writes Alon Halevy, Peter Norvig, and Fernando Pereira, “we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.” [“The Unreasonable Effectiveness of Data,” Expert Opinion, March/April 2009] Why, you might ask, is the effectiveness of data considered unreasonable? The answer involves the fact that much available data is ambiguous (e.g., video content), misleading (e.g., social media rants), duplicated, or simply wrong (e.g., incorrect entries in a database). Even with all of the challenges associated with big data, because data sets are so “big,” errors can be minimized and effective insights gained. Halevy, Norvig, and Pereira focus particularly on the challenges associated with natural language processing (NLP). They continue:

“The biggest successes in natural-language-related machine learning have been statistical speech recognition and statistical machine translation. The reason for these successes is not that these tasks are easier than other tasks; they are in fact much harder than tasks such as document classification that extract just a few bits of information from each document. The reason is that translation is a natural task routinely done every day for a real human need. … In other words, a large training set of the input-output behavior that we seek to automate is available to us in the wild. In contrast, traditional natural language processing problems such as document classification, part-of-speech tagging, named-entity recognition, or parsing are not routine tasks, so they have no large corpus available in the wild. Instead, a corpus for these tasks requires skilled human annotation. Such annotation is not only slow and expensive to acquire but also difficult for experts to agree on, being bedeviled by many of the difficulties.”

Because tagged data is scarce, the researchers recommend using big data sets to establish useful semantic relationships. They claim, “For many tasks, words and word combinations provide all the representational machinery we need to learn from text. … Human language has evolved over millennia to have words for the important concepts; let’s use them.” Many of the services we offer at Enterra Solutions® take advantage of semantic relationships. We use a Sense, Think, Act, and Learn® system to find exactly the kind of useful relationships and insights discussed by Halevy, Norvig, and Pereira. The authors, all of whom worked for Google when they wrote the article, are interested in semantic relationships as they pertain to the World Wide Web. They make a distinction between the Semantic Web and “semantic interpretation.” They write:

“The Semantic Web is a convention for formal representation languages that lets software services interact with each other ‘without needing artificial intelligence.’ A software service that enables us to make a hotel reservation is transformed into a Semantic Web service by agreeing to use one of several standards for representing dates, prices, and locations. The service can then interoperate with other services that use either the same standard or a different one with a known translation into the chosen standard. As Tim Berners-Lee, James Hendler, and Ora Lassila write, ‘The Semantic Web will enable machines to comprehend semantic documents and data, not human speech and writings.'”

From what I read, most people believe the Semantic Web is going to rely heavily on artificial intelligence and will do much more than permit the exchange of standardized data. Nevertheless, Halevy, Norvig, and Pereira rue the fact that the word “semantic” is associated with both web services and more robust interpretation processes. They write:

“The problem of understanding human speech and writing — the semantic interpretation problem — is quite different from the problem of software service interoperability. Semantic interpretation deals with imprecise, ambiguous natural languages, whereas service interoperability deals with making data precise enough that the programs operating on the data will function effectively. Unfortunately, the fact that the word ‘semantic appears in both ‘Semantic Web’ and ‘semantic interpretation’ means that the two problems have often been conflated, causing needless and endless consternation and confusion. The ‘semantics’ in Semantic Web services is embodied in the code that implements those services in accordance with the specifications expressed by the relevant ontologies and attached informal documentation. The ‘semantics’ in semantic interpretation of natural languages is instead embodied in human cognitive and cultural processes whereby linguistic expression elicits expected responses and expected changes in cognitive state. Because of a huge shared cognitive and cultural context, linguistic expression can be highly ambiguous and still often be understood correctly.”

As Shakespeare might write, “That’s the rub.” As the authors note, “The challenges for achieving accurate semantic interpretation are different.” Before discussing the challenges of semantic interpretation, they list the things that have already been solved. They continue:

“We’ve already solved the sociological problem of building a network infrastructure that has encouraged hundreds of millions of authors to share a trillion pages of content. We’ve solved the technological problem of aggregating and indexing all this content. But we’re left with a scientific problem of interpreting the content, which is mainly that of learning as much as possible about the context of the content to correctly disambiguate it. The semantic interpretation problem remains regardless of whether or not we’re using a Semantic Web framework.”

They then go into some detail about remaining challenges that need to be addressed, particularly a way to “infer relationships between column headers or mentions of entities in the world.” They explain:

“These inferences may be incorrect at times, but if they’re done well enough we can connect disparate data collections and thereby substantially enhance our interaction with Web data. Interestingly, here too Web-scale data might be an important part of the solution. The Web contains hundreds of millions of independently created tables and possibly a similar number of lists that can be transformed into tables. These tables represent structured data in myriad domains. They also represent how different people organize data — the choices they make for which columns to include and the names given to the columns. The tables also provide a rich collection of column values, and values that they decided belong in the same column of a table. We’ve never before had such a vast collection of tables (and their schemata) at our disposal to help us resolve semantic heterogeneity. Using such a corpus, we hope to be able to accomplish tasks such as deciding when ‘Company’ and ‘Company Name’ are synonyms, deciding when ‘HP’ means Helmerich & Payne or Hewlett-Packard, and determining that an object with attributes ‘passengers’ and ‘cruising altitude’ is probably an aircraft.”

As difficult as those challenges are (and they do remain hard), they involve structured data. Structured data is generally easier to work with than unstructured data. Given the fact that vast amounts of unstructured data are being created each day, you understand the magnitude of the challenge that lies ahead. Even if all of the technological problems are sorted out (which, I believe, they will be), the one challenge that will never met is deceit. The authors explain:

“We know how to build sound inference mechanisms that take true premises and infer true conclusions. But we don’t have an established methodology to deal with mistaken premises or with actors who lie, cheat, or otherwise deceive. Some work in reputation management and trust exists, but for the time being we can expect Semantic Web technology to work best where an honest, self-correcting group of cooperative users exists and not as well where competition and deception exist.”

Language (written or oral) is not just a matter of words and relationships. Pauses and punctuation can also make a difference. I once saw a sign that read:

Let’s eat Grandma.
Let’s eat, Grandma.
Comma’s save lives!

Although the wolf in the story of Little Red Riding Hood might state the first line, most of us would understand the second line to be correct. So would a good inference engine. But that’s a blog for another day. Halevy, Norvig, and Pereira conclude:

“Follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail. For natural language applications, trust that human language has already evolved words for the important concepts. See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words. Now go out and gather some data, and see what it can do.”

My prediction is that a lot can be done!