Big Data: Hope and Hype, Part 2

Stephen DeAngelis

February 22, 2012

Part 1 of this two-part series involved a discussion about a McKinsey study that concluded that Big Data represents the next frontier. A portion of that discussion included concerns about Big Data analysis raised by Daniel W. Rasmus, who isn’t quite as sanguine about the future of Big Data as the analysts at McKinsey & Company. [“Why Big Data Won’t Make You Smart, Rich, Or Pretty,” Fast Company, 27 January 2012] The discussion ended with two of Rasmus’ nine “existential threats to the success of Big Data and its applications.” In this post, I’ll discuss the remaining threats on his list. Rasmus’ next threat involves complexity. He writes:

“Combining models full of nuance and obscurity increases complexity. Organizations that plan complex uses of Big Data and the algorithms that analyze the data need to think about continuity and succession planning in order to maintain the accuracy and relevance of their models over time, and they need to be very cautious about the time it will take to integrate, and the value of results achieved, from data and models that border on the cryptic.”

Combining models is not the only complexity involved in Big Data. Most observers agree that there are three “Vs” associated with Big Data: volume (terabytes to petabytes and beyond); velocity (including real-time, sub-second delivery); and variety (encompassing structured, unstructured and semi-structured formats). To those three, some observers add a fourth “V”: volatility (which involves the ever-changing sources of data, e.g., new apps, web services, social networks, etc.). Rasmus’ next concern involves feedback loops. He writes:

“Big Data isn’t just about the size of well-understood data sets, it is about linking disparate data sets and then creating connective tissue, either through design or inference, between these data sets.”

I couldn’t agree more. At the heart of the Enterra Solutions® approach is an artificial intelligence (AI) knowledge-base that includes an ontology and extended business rules capable of advanced inference. Ontology interrelates concepts and facts with many-to-many relationships that are generationally more advanced and appropriate for artificial intelligence applications than standard relational databases. It creates the “connective tissue” discussed by Rasmus. His next concern is about the algorithms that drive Big Data applications. He writes:

“It is not only algorithms that can go wrong when a theory proves incorrect or the assumptions underlying the algorithm change. There are places where no theory exists at any level of consensus to be meaningful. The impact of education (and the effectiveness of various approaches), how innovation works, or what triggers a fad are examples of behaviors for which little valid theory exists–it’s not that plenty of opinion about various approaches or models is lacking, but that a theory, in the scientific sense, is nonexistent. For Big Data that means a number of things, first and foremost, that if you don’t have a working theory, you probably don’t know what data you need to test any hypotheses you may posit. It also means that data scientists can’t create a model because no reliable underlying logic exists that can be encoded into a model.”

I agree with Rasmus that a business shouldn’t consider a Big Data solution for any process that they don’t fundamentally understand. No one should know a business better than those who own and operate it. A solutions provider needs to work closely with a company to ensure that the model and algorithms they provide are right and that the data being gathered and analyzed are correct. Rasmus’ next concern involves confirmation bias. He writes:

“Every model is based on historical assumptions and perceptual biases. Regardless of the sophistication of the science, we often create models that help us see what we want to see, using data selected as a good indicator of such a perception. … Even when a model exists that is designed to aid in decision making about the future, that model may involve contentious disagreements about its validity and alternative approaches that yield very different results. These are important debates in the world of Big Data. One group of modelers advocates for one approach, and another group, an alternative approach, both using sophisticated data and black boxes (as far as the uninitiated business person is concerned) to support their cases. The fact is that in cases like this, no one knows the answer definitively as the application may be contextual or it may be incomplete (e.g., a new approach may solve the issue that none of the current approaches solves completely). What can be said, and what must be remembered is, the adage that ‘a futurist is never wrong today.'”

Clearly Big Data has some value when it comes to forecasting; but, Rasmus’ concerns are nonetheless valid. Eliminating (or, at least, reducing) confirmation bias in such systems is an important consideration to keep in mind. Rasmus’ next concern involves the fact that the world changes (i.e., that it is not a good idea to steer a ship by looking astern). He writes:

“We must remember that all data is historical. There is no data from or about the future. Future context changes cannot be built into a model because they cannot be anticipated. Consider this: 2012 is the 50th anniversary of the 1962 Seattle World’s Fair. In 1962, the retail world was dominated by Sears, Montgomery Ward, Woolworth, A&P, and Kresge. Some of those companies no longer exist, and others have merged to the point that they are unrecognizable from their 1962 incarnations. … Would models of retail supply chains built in 1962 be able to anticipate the overwhelming disruption that [Wal-Mart’s] humble storefront would cause for retail? Did Sam Walton understand the impact of Amazon.com when it went live in 1995? The answer to all of the above is ‘no.’ These innovations are rare and hugely disruptive.”

Rasmus is arguing that organizations must be flexible and that models they use must have feedback loops if they are to maintain “relevance through incremental improvement.” He then reminds us that occasionally “the world changes so much that current assumptions become irrelevant and the clock must be started again. Not only must we remember that all data is historical, but we must also remember that at some point historical data becomes irrelevant when the context changes.” Rasmus’ next concern involves motives. He writes:

“Given the complexity of the data and associated models, along with various intended of unintended biases, organizations have to go out of their way to discern the motives of those developing analytics models, lest they allow programs to manipulate data in a way that may precipitate negative social, legal, or fiduciary outcomes.”

We all know that there are numerous privacy concerns associated with the collection and analysis of Big Data. I suspect that privacy concerns are more likely to spur outrage in the general populace than any other concern. Along with data breaches, they are also likely to get a company in trouble more often than other concerns. Rasmus’ final concern involves issues about actions that are taken as a result of Big Data analysis. He writes:

“Consider crime analysis. George Mohler of Santa Clara University in California has applied equations that predict earthquake aftershocks to crime. By using location and data and times of recent crimes, the system predicts ‘aftercrimes.’ This kind of anticipatory data may result in bastions of police flooding a neighborhood following one burglary. With no police presence, the anticipated crimes may well take place. If the burglars, however, see an increase in surveillance and police activity, they may abandon planned targets and seek new ones, thus invalidating the models’ predictions, potentially in terms of time and location. The proponents of Big Data need to ensure that the users of their models understand the intricacies of trend analysis, what a trend really is, and the implications of acting on a model’s recommendations.”

All of these concerns might lead you to believe that Rasmus is anti-Big Data. He’s not. He admits that “some of the emerging Big Data stories don’t test the existential limits of technology, nor do they threaten global catastrophe.” In other words, there are applications for Big Data that are useful. He writes:

“Big Data will no doubt be used to target advertising, reduce fraud, fight crime, find tax evaders, collect child support payments, create better health outcomes, and myriad other activities from the mundane to the ridiculous. And along the way, the software companies and those who invested in Big Data will share their stories.”

Rasmus is interested in how Big Data can improve the quality of life not just a company’s bottom line. He provides a few examples:

“Companies like monumental constructor Arup use Big Data as a way to better model the use of the buildings they build. The Arup software arm, Oasys, recently acquired MassMotion to help them understand the flow of people through buildings. … The result is a model, sometimes with thousands of avatars, pushing and shoving, congregating and separating–all based on MassMotion’s Erin Morrow and how he perceives the world. Another movement oriented application of Big Data, Jyotish (Sanskrit for astrology), comes from Boeing’s research center at the University of Illinois in Urbana-Champaign. This application predicts the movement of work crews within Boeing’s factories. It will ultimately help them figure out how to save costs and increase satisfaction by ensuring that services, like Wi-Fi, are available where and when they are needed. Palantir, the Palo Alto-based startup focused on solving the intelligence problem of 9/11, discovers correlations in the data that informs military and intelligence agencies who, what, and when a potential threat turns into an imminent threat. … For some fields, like biology, placing large data sets into open source areas may bring a kind of convergence as collaboration ensues. But as Michael Nielsen points out in Reinventing Discovery, scientists have very little motivation to collaborate given the nature of publication, reputation, and tenure.”

Rasmus concludes, “I seriously doubt that we have the intellectual infrastructure to support the collaborative capabilities of the Internet. We may well be able to connect all sorts of data and run all kinds of analyses, but in the end, we may not be equipped to apply the technology in a meaningful and safe way at scales that outstrip our ability to represent, understand, and validate the models and their data.” At this point in time, Rasmus is probably correct. Who knows what the world of computing will look like a half-century or century from now. If organizations simply used data to improve business processes, increase marketing opportunities, or better position inventory, Rasmus might have a cheerier view of Big Data. He seems to believe, however, that much more sinister things are afoot. He ends his article this way:

“The future of Big Data lies not in the stories of anecdotal triumph that report sophisticated, but limited accomplishments–no, the future of Big Data rather lies in the darkness of context change, complexity, and overconfidence. I will end, as [Chicago professor Richard H. Thaler] did in his New York Times article (“The Overconfidence Problem in Forecasting“), by quoting Mark Twain: ‘It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.'”

Just because Big Data can be (and likely is) abused, doesn’t mean that there are no benefits to be gained through its collection and analysis. Like any other area of business, ethics is important when dealing with Big Data. The story about Big Data is just beginning to be written. There are likely to be plot twists and turns; but, in the end, my biases tell me that the world will benefit from all this data in ways that are not yet apparent. But it’s good to have gadflies like Rasmus reminding us of potential misuses.