Big Data: The Beginning of an Era

Stephen DeAngelis

May 9, 2013

“For all the change we’ve experienced,” writes Adam Frank, “the most profound transformation of the digital era is really just getting started. Welcome to the era of Big Data.” [“Big Data Is The Steam Engine Of Our Time,” New Hampshire Public Radio, 13 April 2013] He continues:

“Ultimately, the promise of Big Data is the ability to understand (and control) a seemingly chaotic world on levels never before imagined. The dangers of Big Data stem from that very same promise. Its impact on society will be akin to the transformative effect of past technological revolutions.”

Although I admit that the devil is in the details when it comes to Big Data, at the very highest level of abstraction, Big Data holds remarkable potential for helping to shape the world. This potential is eloquently discussed by photographer Rick Smolan in the following video.

If Smolan is correct, and we are only in the “caveman era of Big Data,” you can understand why so much hype is being made about the subject. I like the description of Big Data that Marissa Mayer provided to Smolan. She said that Big Data was “the planet developing a nervous system.” Just as the steam engine was the catalyst for the industrial revolution, Frank believes that Big Data (more specifically, the analysis of Big Data) is likely to spur a new data-driven revolution. He concludes:

“I believe there is something real and powerful happening in the Big Data revolution. It’s more than just a fad. It’s the next link in the long chain connecting culture and technology to human history. Now Big Data — seen and unseen — is hitting us in all corners of our lives, from the price of things to traffic patterns to who our social networks think we befriend. Through new fields like data science and network theory, Big Data will not only change the world we move through as individuals, it will change the world we imagine through science. Like it or not, Big Data will only get bigger (and bigger).”

Gartner analysts believe that 2013 could be the year that really launches the Big Data revolution — at least in the business world. Earlier this year Gartner “conducted a global survey of firms and found that 42 percent of respondents had invested in big data technology, or were planning to do so within a year.” [“Gartner: 2013 is the year of Big Data,” by Antony Savvas, ComputerWorld, 12 March 2013] Paul Kent agrees that 2013 will likely be a watershed period in the deployment of Big Data technologies. In January, he made a five predictions about where he thought the Big Data sector was heading. [“Five big data predictions for 2013,” SAS Voices, 23 January 2013] They were:

  • Streaming data from gadgets, cars and other devices will become an even bigger and more important data source.
  • Social data will be used to make important business and policy decisions, not just for marketing and sharing information with friends
  • The industry will better distinguish between big data BI and big data analytics. Neither is better than the other; but, companies will need to understand which is best suited for their business.
  • A Hadoop cluster will become considered a core element of a company’s ‘analytics platform.’
  • NoSQL people will acknowledge the utility of a general purpose table-oriented query language, and SQL people will agree that relaxing some tenets of the relation model can bring huge advantages to other dimensions of data management problems.

When you talk about creating a “nervous system” for the planet, Kent’s first prediction is the one that is most profound. It involves the so-called Internet of Things that will primarily involve machine-to-machine communication and networks. As I noted earlier, however, that “nervous system” will most likely take decades to develop into something truly revolutionary. In the meantime, the business world will continue to test the envelope of possibilities created by Big Data. Because we are at genesis of the Big Data era, Doug Henschen notes that “Big Data project leaders still hunger for some key technology ingredients.” [“5 Big Wishes For Big Data Deployments,” InformationWeek, 22 April 2013] He writes:

“If you’ve even experimented with building big-data applications or analyses, you’re probably acutely aware that the domain has its share of missing ingredients. We’ve boiled it down to five top wants on the big-data wish list, starting with SQL (or at least SQL-like) analysis options and shortcuts to deployment and advanced analytics and finishing with real-time and network analysis options. The good news is that people and, in some cases, entire communities, are working on these problems. … The whole point of gathering up and making use of big data is to come up with predictions and other advanced analytics that can trigger better-informed business decisions. But with the shortage of data-savvy talent in the world, companies are looking for an easier way to support sophisticated analyses. Machine learning is one technique that many vendors and companies are investigating because it relies on data and computer power, rather than human expertise, to spot customer behaviors and other patterns hidden in data.”

The mention of machine learning is a good segue into Henschen’s first wish: SQL Analysis at Big-Data Scale. He writes:

“You could compile a massive data set just by gathering all the stories and reports that have been written about the shortage of big-data talent. The most acute need is for data scientist types who know data and who also know how to write custom code, MapReduce jobs, and algorithms to gain insights from big data. But what if SQL-savvy professionals schooled in relational databases and business intelligence (BI) and analytics tools could do more of the heavy lifting? There are many more SQL professionals out there than there are data scientists, and most SQL pros would be eager to expand their career potential. There’s a big push to deliver SQL-analysis capabilities on top of Hadoop, and the talent shortage is just one reason. The second reason for the trend is that Apache Hive, Hadoop’s incumbent data warehousing infrastructure, offers a limited subset of SQL-like query capabilities and suffers from slow performance tied to behind-the-scenes MapReduce processing.”

To learn more about the requirement for (and shortage of) data scientists, read my posts entitled Big Data Requires Big Talent, Part 1 and Part 2 and, more recently, The Search for Data Scientists. Henschen’s second wish is for “simplified deployment and management.” He writes:

“There’s no shortage of efforts to simplify the deployment and management of big-data platforms including Hadoop and NoSQL databases. It seems each and every software update brings new management features and new built-in capabilities. …
We haven’t heard a lot of complaining about the hardware-related challenges of building out Hadoop clusters. Nonetheless, EMC, IBM, Oracle and Teradata insist their released and pending Hadoop appliances make deployment faster and easier than the build-it-yourself approach. The cost of commodity hardware might be alluring, but Oracle, for one, says its appliance costs less less than build-it-yourself deployments when taking into account the price of individual components, time saved on provisioning and tuning the system, and support and upgrade efforts. Oracle’s appliance includes pre-configured, ready-to-run versions of Cloudera software and Oracle’s NoSQL database. The real messiness and complication of managing Hadoop usually involves the software, not hardware configuration.”

Some businesses have been reluctant to jump on the Big Data bandwagon because they have been burned previously when buying expensive information technology hardware and software. With many solutions now to moving to the cloud, much of that concern can be eliminated. Nevertheless, their caution is understandable. Henschen’s third wish is for “easier paths to advanced analytics.” He explains:

“Developing algorithms and predictive models is work that has to be carried out by hard-to-find, expensive data scientists. Or is it? Scarcity of talent is one reason big-data, analytics and business intelligence vendors are developing machine-learning approaches. Proven in applications including optical character recognition, spam filtering and computer security threat detection, machine learning uses learning algorithms that are trained by the data itself. If you show the algorithm thousands or tens of thousands of examples of scanned text characters, unsolicited email messages, or virus bots and malware, it can reliably find more examples. The same approach can be applied to spotting customers who are ready to churn or jet engines that are about to fail. With machine learning, trained models also can continue to learn from new data.”

At Enterra Solutions, we are big believers in machine learning. And our Cognitive Reasoning PlatformTM makes it easier for business executives to conduct Big Data analytics by automating human/computer interactions. Henschen’s fourth wish involves “real-time analysis options.” He writes:

“Another item on the big-data analytics wish list is real-time performance. Two startup vendors going after this opportunity are marketing analytics vendor Causata and real-time Hadoop-analysis vendor HStreaming. For Causata, ‘real time’ means making decisions in under 50 milliseconds. You need that kind of speed to change content, banner ads and marketing offers while your customers are still active on websites and mobile devices.”

Although real-time performance is necessary in some situations, information latency is not a problem for all situations in which Big Data can be helpful. Henschen’s final wish is for “network insight.” He writes:

“Social networks are contributing to the scale and variability of big data. The social networks themselves use graph databases and analysis tools to uncover the web of user relationships by studying ‘nodes’ — representing people, companies, locations and so on — and edges, the often-complex relationships among those nodes.
… It’s not the stampede of solutions you see around Hadoop, but there’s clearly growing interest in graph analysis.”

Even before solutions are available, companies are preparing for opportunities that are likely to emerge. Antony Savvas reports, “In anticipation of big data opportunities, said Gartner, organisations across industries are provisionally collecting and storing a burgeoning amount of operational, public, commercial and social data. In most industries combining these sources with existing underutilised ‘dark data’, such as emails, multimedia and other enterprise content, represents the most immediate opportunity to transform businesses.” We really won’t know what lies along the road down which Big Data is about to take us. Most analysts believe that the world will function better because we will have greater understanding. For sure, there will be challenges along the way; but, I believe the journey we are beginning is going to be an exciting one.