Business’ Big Data Problem: Data Quality

Stephen DeAngelis

December 2, 2022

There is almost universal recognition that the age in which we live is dominated by data. Data is the sine qua non of the Digital Age. Here’s the problem: According to a survey conducted by HFS Research, “75 percent of business executives do not have a high-level of trust in their data. … This lack of trust comes despite 89 percent of executives surveyed saying a high level of data quality was critical for success.”[1] It’s hard to imagine how an organization can succeed in the Digital Age if its executives lack trust in what their data is telling them. Tech writer David Curry explains, “As organizations continue to collect more data from an ever-expanding amount of sources, the need for proper data management is critical to ensure future success.”[2] As a result, it appears job one for businesses is to improve data quality.

Improving Data Quality

Business consultant Tejasvi Addagada observes, “As organizations digitize customer journeys, the implications of low-quality data are multiplied manyfold.”[3] It’s the old story of “garbage in, garbage out.” At the same time, data sitting unanalyzed in a large database provides little to no value. Value is derived from analysis. Today, because databases are so enormous, data analysis is more often than not carried out by some form of cognitive technology (aka artificial intelligence). To avoid the “garbage in, garbage out” trap, Addagada advises companies to strengthen their data controls and pay attention to “data quality dimensions.” He insists data quality dimensions are essential for the proper performance of AI models. He suggests there are three important data quality dimensions. They are:

• Accuracy: “How well does data reflect reality, like a phone number from a customer?”

• Completeness: “Is there complete data available to process for a specific purpose, like ‘housing expense’ to provide a loan? (Column completeness — Is the complete ‘phone number’ available? Group completeness — Are all attributes of ‘address’ available?) Is there complete fill rate in storage to process all customers?”

• Validity: “Is data in a specific format? Does it follow business rules? Is it in an unusable format to be processed?”

Tech analyst Benedict Evans puts a different spin on the data quality problem. He suggests organizations stop looking at data and start looking at information. He explains, “Technology is full of narratives, but one of the loudest and most persistent concerns artificial intelligence and something called ‘data’. AI is the future, we are told, and it’s all about data — and data is the future, and we should own it and maybe be paid for it. And countries need data strategies and data sovereignty, too. Data is the new oil. This is mostly nonsense. There is no such thing as ‘data’, it isn’t worth anything, and it doesn’t belong to you anyway. Most obviously, data is not one thing, but innumerable different collections of information, each of them specific to a particular application, that can’t be used for anything else.”[4]

Evan’s point is that organizations must determine what information they need in order to obtain the business insights they desire. For example, he notes, “Siemens has wind turbine telemetry and Transport for London has ticket swipes, and those aren’t interchangeable. You can’t use the turbine telemetry to plan a new bus route, and if you gave both sets of data to Google or Tencent, that wouldn’t help them build a better image recognition system.” When Enterra Solutions® analysts begin work with a client, they spend much of their initial effort helping clients identify the data (or information) they need in order to obtain the results they are seeking. In the long run, this upfront effort saves both time and money. David Angelow, a business consultant and Adjunct Professor in the McCoy School of Business at Texas State University, suggests organizations establish a “digital supply chain (DSC).” He explains, “[In physical supply chains, organizations] use tools, algorithms, and methods to manage the inherent variability in demand and supply to be successful. Yet few organizations are well equipped to manage their DSCs.”[5]

Angelow makes a good point. Data gathering, storage, analysis, and application use many of the same principles that govern physical supply chains. He suggests that looking at data (or information) as the product (i.e., “the output we are delivering), then the SCOR model created by the Association for Supply Chain Management (ASCM) might provide a good model to use when trying to improve data quality. He suggests the following concepts can be used:

• Plan. “Everything starts with a plan. With the DSC, the need is to plan for data we need to capture. Which systems will we use to source the data? How often will we need the data to be refreshed? How long do we need to keep/store the data? What analysis will the data support?”

• Source. “Make [a] strategic sourcing decision to decide the ‘System of Record’ (SOR) for different data elements needed. Similar data may exist in more than one system, and we need a SOR as THE single system that we use for each specific element. With data we do not want ‘dual source,’ because it reduces data integrity (traceability) and accuracy.”

• Make. “Data does not go through an assembly or build process, but each data element can be thought of as a completed item. That means it must be at the appropriate quality level, and with the appropriate availability for any approved ‘customer’ who needs to use the data. That said, data may need to be transformed to be consistent with other data elements — for example standardize all dates into mm-dd-yyyy format, or all currency values to dollars.”

• Deliver/Return. “Physical goods often need complex distribution networks to align product supply with customer demand. Often the ‘last mile’ to the customer affects the complexity of the network the concentration or fragmentation of customers impacts the design. With the data supply chain, the implications of delivery surround who is authorized to obtain and use the data.”

Deloitte analysts observe, “AI can enable incredible business value, and the beating heart of AI is data. But managing data — even when it might appear relatively straightforward — can be hard work.”

Concluding Thoughts

I agree with Evans that end-users aren’t looking for data. They are looking for information and insights. If individuals working with data lose sight of why data is being gathered, stored, analyzed, and used, they offer slim value to the organizations for which they work. Tech writer Madhurjya Chowdhury explains, “Despite the complexity of Big Data, the biggest error that data scientists make is not understanding the context and corporate objectives of their work. … Context is crucial when working with huge data.”[7] If companies obtain the right data and perform the right analysis on that data, more business executives will start trusting their data.

Footnotes
[1] David Curry, “75% Executives Don’t Trust Their Data,” RTInsights, 22 February 2022.
[2] Ibid.
[3] Tejasvi Addagada, “Data Quality Dimensions Are Crucial for AI,” Dataversity, 18 April 2022.
[4] Benedict Evans, “There is no such thing as ‘data’,” Financial Times, 27 May 2022.
[5] David Angelow, “What If You Treated Your Data Supply Chain Like the Physical Supply Chain?” SupplyChainBrain, 5 August 2022.
[6] Deloitte, “5 Ugly Truths About Data—And How to Win at AI Regardless,” The Wall Street Journal, 6 September 2022.
[7] Madhurjya Chowdhury, “Common False Beliefs About Big Data in Business,” Analytics Insight, 3 November 2022.

Data Privacy Day 2023

Tomorrow, 28 January, is Data Privacy Day or, as it is known in Europe, Data Protection Day. You may never have heard of it; nevertheless,

The Confusing World of Data Storage: A Primer

We live in the Digital Age and business consultants are encouraging organizations to transform into digital enterprises. The foundation of this transformation is data, which