An Intelligent Data Pipeline Focuses on Quality

Stephen DeAngelis

January 24, 2022

We’ve all heard that data is the new oil — or, at the very least, a very valuable company asset. In fact, Yossi Sheffi (@YossiSheffi), Director of the MIT Center for Transportation & Logistics, asserts big data is an organization’s most valuable asset. “The well-worn adage that a company’s most valuable asset is its people needs an update,” he writes. “Today, it’s not people but data that tops the asset value list for companies.”[1] Where do companies get their data? Journalist Amber Lee Dennis (@AmberLDennis) explains, “Data can be anywhere. Companies store data in the cloud, in data warehouses, in data lakes, on old mainframes, in applications, on drives — even on paper spreadsheets. Every day we create 2.5 quintillion bytes of data, and there are no signs of this slowing down anytime soon.”[2] Based on the time-tested adage “knowledge is power,” companies have been hoarding data for years. Today, however, there is another truism that is becoming important in the Digital Age: Quality is more important than quantity.

The Importance of Data Quality

A survey conducted by Wakefield Research and Fivetran, a leading provider of automated data integration, found, “71 percent of respondents say end users are making business decisions with old or error-prone data — with 66 percent saying their C-suite doesn’t know this is even happening. As a result, 85 percent of enterprises have made bad decisions that have cost them money. Companies are paying large sums to reach these bad outcomes.”[3] As President and CEO of a decision science firm, I understand the importance of data quality in the decision-making process. Unfortunately, firms are struggling to find and leverage quality data. Research conducted by ESG, in partnership with InterSystems, found, “Nearly half (48%) of organizations are still struggling to use and access quality data as underlying technology is failing to deliver on a number of critical functions. … While organizations are looking to rapidly progress how they deliver data across the value chain, many are still faced with security (47%), complexity (38%), and performance (36%) challenges.”[4]

Data, of course, only has value when it is analyzed for insights. Today, most advanced analytics solutions involve machine learning (ML) in some way or another. Eric Siegel (@predictanalytic), a former computer science professor at Columbia University, observes, “[Companies are using] machine learning — which is genuinely powerful and everyone oughta be excited about it.”[5] On the other hand, analysts at Great Expectations explain, “ML isn’t just a magic wand you can wave at a pile of data to quickly get insightful, reliable results.”[6] Journalist James E. Powell adds, “Poor data quality is Enemy #1 to the widespread, profitable use of machine learning, and for this reason, the growth of machine learning increases the importance of data cleansing and preparation. The quality demands of machine learning are steep, and bad data can backfire twice — first when training predictive models and second in the new data used by that model to inform future decisions.”[7] So what constitutes good data quality? Bud Walker, Vice President of Sales & Strategy for Melissa, breaks down Data Quality into six dimensions.[8] They are:

• Completeness: Are all the pertinent fields filled?
• Validity: Do all the values conform?
• Accuracy: Does the data reflect real-world conditions?
• Consistency: Does the data align with understood patterns?
• Uniqueness: Are there duplicate instances?
• Timeliness: Is it up-to-date?

According to journalist George Lawton (@glawton), “Big data quality issues can lead not only to inaccurate algorithms, but also serious accidents and injuries as a result of real-world system outcomes. At the very least, business users will be less inclined to trust data sets and the applications built on them. In addition, companies may be subject to government regulatory scrutiny if data quality and accuracy play a role in frontline business decisions.”[9]

Intelligent Data Pipelines

According to the Wakefield Research/Fivetran survey mentioned above, “On average, 44 percent of [a data engineer’s] time is wasted building and rebuilding data pipelines, which connect data lakes and warehouses with databases and applications.” Bill Schmarzo, a customer advocate in Data Management Innovation at Dell Technologies, believes the way to rectify this situation is by creating intelligent data pipelines. He explains, “Data is the heart of training AI and ML models. And high-quality, trusted data orchestrated through highly efficient and scalable pipelines means that AI can enable these compelling business and operational outcomes. Just like a healthy heart needs oxygen and reliable blood flow, so too is a steady stream of cleansed, accurate, enriched, and trusted data important to the AI/ML engines.”[10] He goes on to explain:

“The purpose of a data pipeline is to automate and scale common and repetitive data acquisition, transformation, movement, and integration tasks. A properly constructed data pipeline strategy can accelerate and automate the processing associated with gathering, cleansing, transforming, enriching, and moving data to downstream systems and applications. As the volume, variety, and velocity of data continue to grow, the need for data pipelines that can linearly scale within cloud and hybrid cloud environments is becoming increasingly critical to the operations of a business. A data pipeline refers to a set of data processing activities that integrates both operational and business logic to perform advanced sourcing, transformation, and loading of data. A data pipeline can run on either a scheduled basis, in real time (streaming), or be triggered by a predetermined rule or set of conditions.”

When Enterra® provides solutions for clients, one of the first things we sort out is the data pipeline. We work closely with users to understand what they want to accomplish and identify the data needed to achieve that objective. The data pipeline is the sine qua non of world-class cognitive solutions, like the Enterra Global Insights and Decision Superiority System™. Schmarzo concludes, “Chief data officers and chief data analytic officers are being challenged to unleash the business value of their data — to apply data to the business to drive quantifiable financial impact. The ability to get high-quality, trusted data to the right data consumer at the right time in order to facilitate more timely and accurate decisions will be a key differentiator for today’s data-rich companies.”

Concluding Thoughts

According to a report by Kearney, the amount business value that can be unleashed by big data analytics depends on the maturity-level of the company.[11] They break organizations down into four groups. They are:

Laggards. “The organization makes basic use of analytics, usually limited to descriptive analyses of data, to retrospectively report on performance. They usually lack the analytics strategy and the culture needed to move forward.”

Followers. “The organization uses analytics to diagnose business problems and manage costs. This is largely made up of inferential modeling, and analytics is not used to inform strategic business decisions. There is no evidence of an analytics culture championed by top management.”

Explorers. “The organization uses analytics to optimize performance by diagnosing drivers and predicting outcomes. Although they have some analytics strategy in place, they don’t have a well-developed culture of data-driven decision-making across the organization.”

Leaders. “The organization has a clearly defined analytics strategy that aligns with the overall business strategy. The C-suite has a clear commitment to analytics and fosters a culture of data-driven decision-making. They use real-time analytics to drive innovation and create a competitive advantage across all areas of the business.”

Partnering with the right decision science firm, organizations can quickly climb from one group to the next with spectacular results. According to Kearney, Laggards could improve profitability by 81% if they were as effective as Leaders. Followers could improve profitability by 55% if they were as effective as Leaders. And Explorers could improve profitability by 20% if they were as effective as Leaders. To help our clients achieve Leaders status, Enterra Solutions® is advancing the field of Autonomous Decision Science™ so that the best possible decisions can be made at today’s speed of business.

Footnotes
[1] Yossi Sheffi, “What is a Company’s Most Valuable Asset? Not People,” LinkedIn, 19 December 2018.
[2] Amber Lee Dennis, “The Many Dimensions of Data Quality,” Dataversity, 27 November 2019.
[3] Editorial Team, “Data and Analytics Leaders Report Wasting Funds on Bad Data,” insideBIGDATA, 21 November 2021.
[4] Editorial Team, “Almost Half of Organizations Still Struggle with the Quality of their Data,” insideBIGDATA, 11 October 2021.
[5] Eric Siegel, “Why A.I. is a big fat lie,” Big Think, 23 January 2019.
[6] Staff, “Why data quality is key to successful ML Ops,” Great Expectations Blog, 28 September 2020.
[7] James E. Powell, “CEO Q&A: Data Quality Problems Will Still Haunt Your Analytics Future,” Transforming Data With Intelligence (TDWI), 16 April 2019.
[8] Dennis, op. cit.
[9] George Lawton, “Data quality for big data: Why it’s a must and how to improve it,” TechTarget, 27 April 2021.
[10] Bill Schmarzo, “Evolution of intelligent data pipelines,” MIT Technology Review, 6 December 2021.
[11] Ujwal Kayande, Enrico Rizzon, and Mohit Khandelwal, “The impact of analytics in 2020,” Kearney, 2021.

Data Privacy Day 2023

Tomorrow, 28 January, is Data Privacy Day or, as it is known in Europe, Data Protection Day. You may never have heard of it; nevertheless,

The Confusing World of Data Storage: A Primer

We live in the Digital Age and business consultants are encouraging organizations to transform into digital enterprises. The foundation of this transformation is data, which