Interest in natural language processing (NLP) began in earnest in 1950 when Alan Turing published his paper entitled “Computing Machinery and Intelligence,” from which the so-called Turing Test emerged. Turing basically asserted that a computer could be considered intelligent if it could carry on a conversation with a human being without the human realizing they were talking to a machine. The goal of natural language processing is to allow that kind of interaction so that non-programmers can obtain useful information from computing systems. This kind of interaction was popularized in the 1968 movie “2001: A Space Odyssey” and in the Star Trek television series. Natural language processing also includes the ability to draw insights from data contained in emails, videos, and other unstructured material. “In the future,” writes Marc Maxson, “the most useful data will be the kind that was is too unstructured to be used in the past.” [“The future of big data is quasi-unstructured,” Chewy Chunks, 23 March 2013] Maxson believes, “The future of Big Data is neither structured nor unstructured. Big Data will be structured by intuitive methods (i.e., ‘genetic algorithms’), or using inherent patterns that emerge from the data itself and not from rules imposed on data sets by humans.”
Alissa Lorentz agrees with Maxson that the amazing conglomeration of data now being collected is mostly of the unstructured variety. “The expanding smorgasbord of data collection points are turning increasingly portable and personal, including mobile phones and wearable sensors,” she writes, “resulting in a data mining gold rush that will soon have companies and organizations accruing Yottabytes (10^24) of data.” [“With Big Data, Context is a Big Issue,” Wired Innovation Insights, 23 April 2013] She continues:
“To put things into perspective, 1 Exabyte (10^18) of data is created on the internet daily, amounting to roughly the equivalent of data in 250 million DVDs. Humankind produces in two days the same amount of data it took from the dawn of civilization until 2003 to generate, and as the Internet of Things become[s] a reality and more physical objects become connected to the internet, we will enter the Brontobyte (10^27) Era. So it’s all dandy that we’re getting better and better at sucking up every trace of potential information out there, but what do we do with these mountains of data? Move over Age of Content, enter the Age of Context.”
Lorentz provides a quick and understandable example of why context matters:
“When looking at unstructured data, for instance, we may encounter the number ’31’ and have no idea what that number means, whether it is the number of days in the month, the amount of dollars a stock increased over the past week, or the number of items sold today. Naked number ’31’ could mean anything, without the layers of context that explain who stated the data, what type of data is it, when and where it was stated, what else was going on in the world when this data was stated, and so forth. Clearly, data and knowledge are not the same thing.”
Maurizio Lenzerini agrees with Lorentz. Even if the data is structured, he notes, integrating and relating that data can be an IT nightmare. As he puts it, “The problem is even more severe if one considers that information systems in the real world use different (often many) heterogeneous data sources, both internal and external to the organization.” [“Ontology-based Data Management,” ACM SIGMOD Blog, 14 May 2013] He adds, “If we add to the picture the (inevitable) need of dealing with big data, and consider in particular the two v’s of ‘volume’ and ‘velocity’, we can easily understand why effectively accessing, integrating and managing data in complex organizations is still one of the main issues faced by IT industry nowadays.” Although he didn’t specifically mention the third “V” — variety — that was what he had in mind when he was discussing heterogeneous data sources. “When talking about data variety,” writes Ling Zhang, “most often people talk about multiple or diverse data sources, variant data types, structures and formats, say, structured, semi or non-structured data like text, images and videos.” [“Data Variety: What It’s All About,” SmartData Collective, 14 May 2013] She goes on to explain that variety involves even more complexity because you have to consider subjectivity.
“Except for those most common types of variety, contextual information around data and the methods used for creating and gathering data as well as the high dimensionality of data should be also considered as data variety. Those varieties can be counted as objective or physical elements of data variety. Except for the objective nature, data variety also includes subjective nature that is usually missing or ignored by people. What I mean by subjective variety is the interpretation of data or the insight from different perspectives and different entities like people, group and business and their corresponding usages or applications. Because those factors actually drive the way to analyze, mine, integrate and use data or explain the results. And the subjective variety matters as much as objective variety. I also believe subjective variety will drive more objective data varieties.”
It should be clear by now that natural language processing involves a lot more than a computer recognizing a list of words. As Mark Kumar asserts, “The issue of data variety remains … difficult to solve programmatically. … As a result, many big data initiatives remain constrained by the skills of the people available to work on them. And this challenge is keeping the industry from realizing the full potential of big data in diverse fields.” [“Why Variety Is the Unsolved Problem in Big Data,” SmartData Collective, 30 October 2013] Kumar agrees with Lorentz that, “when it comes to data variety, a large part of the challenge lies in putting the data into the right context.”
At Enterra Solutions® we believe that only a system that can sense, think, learn, and act is going to be up the challenge of performing natural language processing. Our Cognitive Reasoning Platform™ (CRP) uses a combination of artificial intelligence and the world’s largest common sense ontology to help identify relationships and put unstructured data in the proper context. The reason that a learning system is necessary is because the veracity of data is not always what one would desire. To overcome this shortfall, massive amounts of data are required to be analyzed. Greg Satell explains, “Imagine you had billions of data points all being collected and analyzed in real time and real conditions. That’s the secret to the transformative power of big data. By vastly increasing the data we use, we can incorporate lower quality sources and still be amazingly accurate. What’s more, because we can continue to reevaluate, we can correct errors in initial assessments and make adjustments as facts on the ground change.” [“Before You Can Manage Big Data, You Must First Understand It,” Forbes, 22 June 2013] Lenzerini explains how a system of this type basically works:
“The distinguishing feature of the whole approach is that users of the system will be freed from all the details of how to use the resources, as they will express their needs in the terms of the [Domain Knowledge Base] DKB. The system will reason about the DKB and the mappings, and will reformulate the needs in terms of appropriate calls to services provided by resources. Thus, for instance, a user query will be formulated over the domain ontology, and the system will reason upon the ontology and the mappings to call suitable queries over data sources that will compute the answers to the original user query. As you can see, the heart of the approach is the DKB, and the core of the DKB is the ontology.”
Most analysts appear to agree that the next big thing in IT is going to involve semantic search. It’s going to be a big thing because it will allow non-subject matter experts to obtain answers to their questions using only natural language to pose their queries. The magic will be contained in the analysis that goes into the search that leads to answers that are both relevant and insightful.