Home » Big Data » Understanding Data

Understanding Data

April 30, 2010

supplu-chain

For most people living in the developed world (and many people living in the developing world), the information age has brought with it a mountain of data that must be stored, analyzed, and sometimes even understood! Since mankind first started collecting data he (or she) has tried to make sense of it. For the mathematically challenged, making sense of data can be as difficult as trying to decipher the meaning of Nostradamus’ prophecies. Topping the chart of perplexing mathematics for many people is calculus. In a refreshing article, Steven Strogatz provides a quick and understandable explanation of how calculus can help us understand the world [“Change We Can Believe In,” New York Times, 11 April 2010]. He writes:

“Long before I knew what calculus was, I sensed there was something special about it. My dad had spoken about it in reverential tones. He hadn’t been able to go to college, being a child of the Depression, but somewhere along the line, maybe during his time in the South Pacific repairing B-24 bomber engines, he’d gotten a feel for what calculus could do. Imagine a mechanically controlled bank of anti-aircraft guns automatically firing at an incoming fighter plane. Calculus, he supposed, could be used to tell the guns where to aim. Every year about a million American students take calculus. But far fewer really understand what the subject is about or could tell you why they were learning it. It’s not their fault. There are so many techniques to master and so many new ideas to absorb that the overall framework is easy to miss. Calculus is the mathematics of change. It describes everything from the spread of epidemics to the zigs and zags of a well-thrown curveball. The subject is gargantuan — and so are its textbooks. Many exceed 1,000 pages and work nicely as doorstops. But within that bulk you’ll find two ideas shining through. All the rest, as Rabbi Hillel said of the Golden Rule, is just commentary. Those two ideas are the ‘derivative’ and the ‘integral.’ Each dominates its own half of the subject, named in their honor as differential and integral calculus. Roughly speaking, the derivative tells you how fast something is changing; the integral tells you how much it’s accumulating. They were born in separate times and places: integrals, in Greece around 250 B.C.; derivatives, in England and Germany in the mid-1600s. Yet in a twist straight out of a Dickens novel, they’ve turned out to be blood relatives — though it took almost two millennia to see the family resemblance.”

Just in case that quick distinction between differential and integral calculus remains unclear, Strogatz continues his explanation of derivatives (these are not the derivatives that got Wall Street in trouble!):

“Derivatives are all around us, even if we don’t recognize them as such. For example, the slope of a ramp is a derivative. Like all derivatives, it measures a rate of change — in this case, how far you’re going up or down for every step you take. A steep ramp has a large derivative. A wheelchair-accessible ramp, with its gentle gradient, has a small derivative. Every field has its own version of a derivative. Whether it goes by ‘marginal return’ or ‘growth rate’ or ‘velocity’ or ‘slope,’ a derivative by any other name still smells as sweet. Unfortunately, many students seem to come away from calculus with a much narrower interpretation, regarding the derivative as synonymous with the slope of a curve. Their confusion is understandable. It’s caused by our reliance on graphs to express quantitative relationships. By plotting y versus x to visualize how one variable affects another, all scientists translate their problems into the common language of mathematics. The rate of change that really concerns them — a viral growth rate, a jet’s velocity, or whatever — then gets converted into something much more abstract but easier to picture: a slope on a graph. Like slopes, derivatives can be positive, negative or zero, indicating whether something is rising, falling or leveling off.”

In a follow-up column, Strogatz explains the integral [“It Slices, It Dices,” New York Times, 18 April 2010]. He writes:

“In astronomy, the gravitational pull of the sun on the earth is described by an integral. It represents the collective effect of all the minuscule forces generated by each solar atom at their varying distances from the earth. In oncology, the growing mass of a solid tumor can be modeled by an integral. So can the cumulative amount of drug administered during the course of a chemotherapy regimen. Historically, integrals arose first in geometry, in connection with the problem of finding the areas of curved shapes. … The area of a circle can be viewed as the sum of many thin pie slices. In the limit of infinitely many slices, each of which is infinitesimally thin, those slices could then be cunningly rearranged into a rectangle whose area was much easier to find. That was a typical use of integrals. They’re all about taking something complicated and slicing and dicing it to make it easier to add up. In a 3-D generalization of this method, Archimedes (and before him, Eudoxus, around 400 B.C.) calculated the volumes of spheres, cones, barrels, prisms and various other solid shapes by re-imagining them as stacks of many wafers or discs, like a salami sliced thin. By computing the volumes of the varying slices, and then ingeniously integrating them — adding them back together — they were able to deduce the volume of the original whole.”

I began this post with Strogatz’ articles because he helps even the mathematically challenged understand that mathematics can be useful for both generating and understanding data. If you think that calculus is difficult, then you just might find information generated at the quantum level to be just plain weird. At least that’s the conclusion of a review of the book entitled Decoding Reality: The Universe as Quantum Information by Vlatko Vedral [“Weirder science,” by Alan Cane, Financial Times, 10 April 2010]. According to Cane, “information theory can explain both quantum mechanics and the stock market,” even if it is weird. He explains:

“Since the start of the 20th century, theoretical physics has provided a rich seam for authors keen to explain in popular terms the nature of ‘reality’. We already know, courtesy of Albert Einstein, Niels Bohr and others, that reality is, to quote JBS Haldane, ‘queerer than we imagine and queerer than we can imagine’. The reality we can imagine involves elementary particles – protons, neutrons and electrons and even more elementary particles such as quarks and neutrinos. These are reassuringly physical, even if they exhibit the disturbing quantum properties of being in two places at once and able to communicate over astronomical distances. Now Vlatko Vedral, professor of quantum information science at Oxford University, seeks to persuade us that at its most fundamental, reality is encoded in information. This alone, he argues, is enough to explain quantum mechanics as well as biological inheritance, sociology and the stock market. (Interestingly, Claude Shannon, the ‘father’ of information theory, made a fortune out of shares but died with his investment secrets intact.) ‘The reader may not agree with my ultimate view of encoding reality,’ Vedral writes, ‘but hopefully he or she will find the discussion of the separate pillars (biology, economics, gambling and so on) valuable in themselves.’ Certainly he provides conclusive evidence that gambling on the lottery is a waste of money. An immediate question, however, is how to define ‘information’. Most of us have a rough idea. Vedral has a very precise, scientific definition: it is the logarithm of the inverse of the probability of an event. Or, more simply, the more unexpected an event, the more information it contains.”

I like that notion and on the surface it rings true. How many times have you heard someone say that they learn more from their failures than their successes? For the same reason, we always learn a lot from crises, regardless of how many crises we’ve been through before. Cane, however, believes that Vedral’s claim is counterintuitive. He concludes:

“On this somewhat counterintuitive foundation, [Vedral] builds castles of mathematical logic: ‘In biology, for example, an event could be a genetic modification stimulated by the environment. In economics, an event could be a fall in a share price. In quantum physics, it could be the emission of light by a laser when it is switched on. No matter what the event is, you can apply information theory to it. That is why I will be able to argue that information underlies every process we see in nature,’ he says.”

Cane seems skeptical. His skepticism may result from Vedral’s musings in his book about the reality or non-reality of God — not an area normally addressed by theoretical physicists. Nevertheless, Vedral’s theory is interesting to ponder. We are able to gather and understand information in a remarkable number of ways and experts are able to “scour oddball data to help see trends before official information is available” [“New Ways to Read Economy,” by Cari Tuna, Wall Street Journal, 8 April 2010]. Tuna explains:

“When [San Francisco’s] top economist needs a rough prediction of sales tax revenues, he watches the number of subway passengers emerging from the Powell Street Station on Saturdays. Ted Egan, chief economist in the San Francisco Controller’s Office, said he could wait six months for California to release the detailed sales-tax data he needs for city revenue projections. But it’s quicker to look at passenger tallies from the station closest to the Union Square shopping district, which generates roughly 10% of the city’s sales-tax revenue. The Bay Area Rapid Transit District releases the data within three days, he said: ‘Why should I have to wait?’ Mr. Egan is among a growing number of economists and urban planners who scour for economic clues in unconventional urban data—oddball measures of how people are moving, spending and working.”

Egan is wrestling with the same problem that has faced military and development personnel for years. What are good measures of effectiveness (MOE)? For years, the military was at a loss to find good MOE in situations where “body counts” proved to be inappropriate for the task at hand. And development personnel often find themselves working in situations where official data is simply unavailable. Both groups have had to become creative like Mr. Egan. Tuna continues:

“Broadway ticket sales are a favorite indicator for the chief economist of the New York City Economic Development Corp., Francesco Brindisi. He says they are a good gauge of city tourism. In Jacksonville, Fla., community planner Ben Warner keeps tabs on calls to the city’s 2-1-1 hotline for social services. Since late 2008, he has seen spikes in calls for help with food, housing, utilities payments and suicide prevention. It is ‘direct, real-time monitoring of the economic and social situation,’ he said. At an economic briefing at San Francisco City Hall last month for officials and industry experts, Mr. Egan flashed slides of traditional indicators, along with the number of customers at parking garages near Union Square and average rents for one-bedroom apartments advertised on Craigslist. Mr. Egan’s parking and rent indicators bottomed out last year and are beginning to trend upward, suggesting the local economy isn’t getting much worse. ‘It’s not an exact science,’ he said. But when it comes to data, he said, ‘more is almost always better than less.’ And there is always more. Mr. Egan said he would like to build software to monitor Craigslist prices for furniture, concert tickets, haircuts and other goods and services to measure changes in local prices. The online classified-ads site, he said, would give a quicker and more detailed read than the bimonthly data from the Labor Department. Advancing technology is changing the makeup of the economy, Mr. Egan said, so ‘you never know where the green shoots are going to come from.’ The focus by economic prognosticators on urban data follows a history of people looking to nontraditional signs of impending boom or bust.”

When the military is engaged in stabilization operations, they too look for nontraditional signs of success or failure. Such things as the number of soccer fields being used on a Saturday morning or the amount of late evening activity in a shopping district can tell one a lot about how secure people are feeling. Tuna reports that experts are discovering a lot of non-traditional indicators from which they can draw information.

“For instance, some economists consider cardboard-box production a leading indicator of economic activity. But the newest offbeat indicators, made possible by improving systems for collecting and disseminating data, are painting even timelier and more geographically specific pictures of economic forces, economists say. ‘Information-technology is allowing the city’s economy to speak to us in lots of different ways,’ Mr. Egan said. ‘We just need to find new ways of listening.’ One rich repository of predictive data is Web searches, said Hal Varian, Google Inc.’s chief economist. Jumps in such queries as ‘unemployment office’ and ‘jobs’ can help predict increases in initial jobless claims, he said. Other search terms, he added, can anticipate traditional data on travel behavior and sales of cars and homes. Some economists warn that urban data often are newer and more volatile than traditional indicators, making them harder to incorporate into analysis and forecasts. ‘I’ll look at it, but I discount it very, very significantly,’ says Mark Zandi, chief economist at Moody’s Analytics. But sometimes, new indicators are more reliable than conventional ones, said Edward Leamer, an economist at the University of California, Los Angeles. He swears by diesel fuel sales, for example. UCLA’s Anderson School of Management recently teamed up with Ceridian Corp., a payments and payroll company, to collect data on diesel purchases by truckers nationwide. The data anticipate increases in U.S. industrial production and gross domestic product, said Mr. Leamer, director of the school’s economic-forecasting group. Mr. Leamer discovered that truckers’ diesel purchases on Interstate Highway 5 from California to Oregon, a major timber-trucking route, are a leading indicator of construction employment in California. Diesel sales on Interstate Highway 80 from Sacramento to Salt Lake City, a trucking route for the San Francisco Bay area’s manufactured goods, can help predict California’s manufacturing employment, he said.”

Having data is different than understanding it or trusting it. Sometimes it’s not even a matter of trust. Sometimes people don’t use data because it tells them something they don’t want to hear or dismiss it because they don’t believe it is accurate. Tuna concludes:

“In the first half of 2008, when major government-issued indicators failed to hint at the U.S. economy’s impending downward spiral,… UCLA forecasters chose not to announce a recession because GDP was still growing and the Bureau of Labor Statistics was reporting relatively mild job losses. Bad call. The government later revised the GDP and jobs data downward, and the National Bureau of Economic Research concluded that the recession started in December 2007. The jobs data are unreliable because they are based on sample surveys and don’t adequately capture company openings and closings, Mr. Leamer said in hindsight. When the UCLA economists reviewed the fuel-purchases data late last year, they saw diesel buying had peaked in mid-2007, indicating that fewer goods were being made and moved across the country in the months after. ‘Had we been aware of that data in 2008,’ Mr. Leamer said, ‘we would have made a different call.’ Mr. Leamer is eager to divine more-precise insights from diesel fuel sales. ‘There’s some truck stop somewhere in the country that is the perfect leading indicator,’ he said.”

I’m not sure there is a perfect leading indicator; but I do know that shared data can help organizations make better decisions. Supply chain guru Steve Barker agrees [“Reading the Tea Leaves: Diesel Fuel Sales as a Leading Economic Indicator,” Logistics Viewpoint, 12 April 2010]. After reading Tuna’s article, Barker wrote:

“This article sparked an idea: What if leading carriers were willing to join together and provide their shipment data to trusted third parties on a weekly basis? These third parties could aggregate and analyze the data, while keeping each carrier’s data confidential. This could provide carriers with a new revenue stream, although probably a small one. The bigger value proposition for carriers would be receiving advance warnings about the state of regional economies and key industries. This information would in turn help them perform better financial, capacity, and labor planning.”

Sometimes, like in the UCLA case, it’s not a lack of data that’s the problem. It’s getting the right data in front of the right people at the right time. Sharing information via user-friendly data displays is becoming as important as gathering the data itself. William Pollard noted, “Information is a source of learning. But unless it is organized, processed, and available to the right people in a format for decision making, it is a burden, not a benefit.” I couldn’t agree more.

Related Posts: