The "Big Data" Dialogues, Part 3

Stephen DeAngelis

September 21, 2011

In Parts One and Two of these dialogues, I discussed the subject of big data from a supply perspective — specifically, the perspective of supply chain analyst Lora Cecere. Cecere keeps her pulse on cutting edge concepts; but, she’s not the only one. Big data has also caught the attention of IBM CEO Sam Palmisano. He recently traveled to Silicon Valley to meet “with venture-capital firms to explore how to take advantage of shifts in ‘big data’.” According to Deborah Gage, Palmisano is “the latest tech executive to tap the connections of early-stage investors.” [“IBM Visits ‘Big Data’ Arena,” Wall Street Journal, 8 September 2011] Gage continues:

“Mr. Palmisano was treading a well-worn path of tech executives seeking to mine the knowledge of Silicon Valley venture capitalists, who often spot new technologies before giants like IBM. Among other non-Silicon Valley tech firms, Microsoft Corp. has opened a Mountain View office to work with start-ups and venture capitalists, AOL Inc. has posted hiring billboards along Highway 101, and EMC Corp., Citrix Systems Inc. and Research In Motion Ltd. have all bought start-ups in the region.”

As I noted in the first part of this series on big data, Cecere believes that, within five years, “the holistic use of this data will be mainstream.” Apparently, others also believe that the big data future is rushing at us at breakneck speed. Gage continues:

“IBM has said it expects to spend $20 billion on acquisitions between 2011 and 2015. The Armonk, N.Y., company has focused for several years on ‘big data,’ the advancement in data-storage capacity, processing power and analytical ability that has begun to transform industries. IBM Vice President Rod Smith said venture-backed start-ups can help IBM’s big-data push because ‘we can’t do it all. We’re seeing a lot of start-ups. It’s a nice, big ocean, and we know we can float a lot of boats.'”

All of this interest in big data has “sparked an arms race among the start-ups for math specialists who can slice and dice data about users’ online behavior to generate more revenue.” [“Online Trackers Rake In Funding,” by Scott Thurm, Wall Street Journal, 25 February 2011] Although the financial sector, with its supersized compensation packages, still attracts its share of math whizzes, Thurm reports that the big data sector is now drawing some of the best mathematicians into its circle. He reports:

“Some [mathematicians] have migrated to advertising from Wall Street. Ari Buchalter, chief operating officer at digital-ad firm MediaMath Inc., holds a Ph.D. in astrophysics and once ran a hedge fund that based trades on mathematical formulas. The company’s CEO, Joe Zawadzki, is a former investment banker. Media6Degrees, which also analyzes users’ social connections, employs four Ph.D. holders. Its chief scientist is a renowned data analyst and machine-learning specialist.”

One of the reasons that big data has become a hot topic is because a “free software named after a stuffed elephant” is making the analysis of such data much easier. The software about which I write is called Hadoop, which was created as an “open-source software … by a group of Yahoo! developers.” [“Getting a Handle on Big Data with Hadoop,” by Rachael King, Bloomberg BusinessWeek, 7 September 2011] King writes:

“With its online sales less than a fifth of Amazon’s last year, Wal-Mart executives have turned to software called Hadoop that helps businesses quickly and cheaply sift through terabytes or even petabytes of Twitter posts, Facebook updates, and other so-called unstructured data. Hadoop, which is customizable and available free online, was created to analyze raw information better than traditional databases like those from Oracle. ‘When the amount of data in the world increases at an exponential rate, analyzing that data and producing intelligence from it becomes very important,’ says Anand Rajaraman, senior vice-president of global e-commerce at Wal-Mart and head of @WalmartLabs, the retailer’s division charged with improving its use of the Web.”

King reports that other large corporations using Hadoop include: Walt Disney, General Electric, Nokia, and Bank of America. She notes that the software’s popularity is, in large part, based on its flexibility. “The software can be applied to a variety of tasks,” she writes, “including marketing, advertising, and sentiment and risk analysis.” Perhaps the best endorsement the software has received came from IBM, which “used the software as the engine for its Watson computer, which competed with the champions of TV game show Jeopardy.” For more on that competition, read my post entitled Artificial Intelligence and the Future. King asserts that “Hadoop is riding the ‘big data’ wave, where the massive quantity of unstructured information ‘presents a growth opportunity that will be significantly larger’ than the $25 billion relational database industry dominated by Oracle, IBM, and Microsoft, according to a July report by Cowen & Co.” Obviously, large numbers (like $25 billion) get my attention since Enterra Solutions® is also in the big data analysis business. Enterra’s technologists are also familiar with Hadoop and flexibility it provides. King continues:

“This year, 1.8 zettabytes (1.8 trillion gigabytes) of data will be created and replicated, according to a June report by market research firm IDC Digital Universe and sponsored by EMC, the world’s biggest maker of storage computers. One zettabyte is the equivalent of the information on 250 billion DVDs, according to Cisco Systems’ Visual Networking Index.”

Now you know why people call it “big” data. King continues:

“One of the challenges of Hadoop is getting it all to work together in a corporation. Hadoop is made up of a half-dozen separate software pieces that require integration to get it to work, says Merv Adrian, a research vice-president at Gartner. That requires expertise, which is in short supply, he says. … [Nevertheless,] the increasing popularity of Hadoop software also mirrors the growth in corporate spending on handling data. Since 2005, the annual investment by corporations to create, manage, store, and generate revenue from digital information has increased 50 percent to $4 trillion, according to the IDC report.”

Enterra Solutions specializes in the handling of unstructured data. This kind of analytical support is essential because, “about 80 percent of corporations’ data is the unstructured type, which includes office productivity documents, e-mail, Web content, in addition to social media.” Making sense of unstructured data is difficult because it involves nuances that traditional analytical systems simply can’t master. King continues:

“Oracle sells companies its Exadata system to manage huge quantities of structured information such as financial data. ‘Hadoop plays in a much larger market than Exadata and is a materially cheaper way to process vast data sets,’ says Peter Goldmacher, an analyst at Cowen & Co. in San Francisco.”

King reports that the potential of Hadoop is so large that analysts “expect Oracle to make a Hadoop-related announcement in October.” King also mentions another tool we use at Enterra — MapReduce. She writes:

“Web companies were the first to face the big-data challenge now confronting large corporations. In 2004, Google published a paper about software called MapReduce that used distributed computing to handle large data sets.”

King brings up the MapReduce paper because she indicates that it was part of the inspiration for Yahoo employee, Doug Cutting, to create Hadoop in 2006. She reports that Cutting, named the software “after his son’s stuffed elephant.” According to King, “Cutting now works at a company called Cloudera that offers Hadoop-related software and services for corporations. Its customers include Samsung Electronics, AOL Advertising, and Nokia.” She continues:

“‘It was obvious to me that the problems that Google and Yahoo and Facebook had were the problems that the other companies were going to have later,’ says Cloudera Chief Executive Officer Mike Olson. While Yahoo developers have contributed most of the code to Hadoop, it’s an open project, part of the Apache Software Foundation. Developers around the world can download and contribute to the software. Other Hadoop-related projects at Apache have names such as Hive, Pig, and Zookeeper.”

Cutting is apparently not the only Yahoo programmer to leave the company and pursue Hadoop activities elsewhere. King reports, “Some of the original Yahoo contributors to Hadoop have formed a spinoff called Hortonworks to focus on development of the software. The company expects that within five years more than half of the world’s data will be stored in Hadoop environments.” My guess is that Hortonworks was named after Dr. Seuss’ famous elephant Horton (from Horton Hears a Who!).

King asserts that Wal-Mart is making a big bet on Hadoop. She explains:

“Wal-Mart, recognizing that the next generation of commerce would be social, purchased startup Kosmix for $300 million in April to create @WalmartLabs. The acquisition gave it immediate expertise in big data: Kosmix co-founders Rajaraman and Venky Harinarayan co-founded Junglee, the company that pioneered Internet comparison shopping in 1996 and was later purchased by Amazon. At Kosmix, they also built something called the Social Genome, which uses semantic-analysis technology and applies it to a real-time flood of social media to understand what people are saying. For now, @WalmartLabs uses Hadoop to create language models so that the site can return product results if the shopper enters a related word. For example, if somebody searches for a ‘backyard chair’ on Walmart.com, the site will return results for patio furniture. In the future, Wal-Mart may be able to return styles of patio furniture most likely to appeal to a particular shopper based on his tweets and Facebook updates. The company also uses Hadoop in its keyword campaigns to drive traffic from search engines to Walmart.com. The software collects information about which keywords work best to turn Internet surfers into shoppers, and then comes up with the optimal bids for different words.”

King reports that Nokia is “another company that recently recognized the treasure trove of data it’s sitting on.” She explains:

“‘Sixty percent of the world’s population has mobile devices, and Nokia has 25 percent of those mobile customers,’ says Amy O’Connor, senior director of analytics at Nokia. ‘Over the course of the past year we realized we had all this data we could use competitively.’ For example, Nokia collects information for its Navteq mapping service that it sells to large businesses. The company can tap into data from probes and mobile devices around the world to collect data on traffic. To figure out information about a particular street, the company used to have people weed through hundreds of terabytes of data. ‘It was a manual process before Hadoop,’ O’Connor says. Now that the company is taking advantage of this unstructured information, the amount of data that it manages is skyrocketing. Over the next year or so, O’Connor anticipates that Nokia’s network will be handling as much as 20 petabytes (20 million gigabytes) of information, up from several hundred terabytes managed over the past year. ‘The tsunami of data is not going to stop,’ O’Connor says.”

The fact that “the tsunami of data is not going to stop” is exactly what companies involved in the big data sector are counting on. Since it is an open source software, Hadoop is likely to continue to play a major role in how big data is handled. Companies, like Enterra Solutions, will build upon what Hadoop has to offer to ensure that clients get the most benefit they can from the growing oceans of unstructured data.

Data Privacy Day 2023

Tomorrow, 28 January, is Data Privacy Day or, as it is known in Europe, Data Protection Day. You may never have heard of it; nevertheless,

The Confusing World of Data Storage: A Primer

We live in the Digital Age and business consultants are encouraging organizations to transform into digital enterprises. The foundation of this transformation is data, which