Is the Future of Big Data Synthetic?

Stephen DeAngelis

July 5, 2022

We live in a Digital Age in which data has been called the lifeblood of an organization and its most valuable asset. For example, Ivan Kot, the Customer Acquisition Director and IT Solution Manager at Itransition, writes, “Data has established itself as one of the most important elements in both business and science.”[1] And, Yossi Sheffi (@YossiSheffi), Director of the MIT Center for Transportation & Logistics, asserts, “The well-worn adage that a company’s most valuable asset is its people needs an update. Today, it’s not people but data that tops the asset value list for companies.”[2] In recent years, however, the collection, storage, and analysis of data has bumped hard against concerns over privacy and a rash of cybersecurity breaches. With the threat of significant financial penalties hanging over the heads of companies whose breached databases expose personal client data, many of them are now examining the possibility of using synthetic data.

What is Synthetic Data?

Kot explains, “Synthetic data is exactly what it sounds like. It’s data that has the same mathematical and statistical properties as authentic data but doesn’t put user privacy at stake. In other words, such data can be used to accurately train machine learning models and make statistics-based conclusions without revealing personally identifiable information.” Where does this synthetic data come from? Ironically, notes Kot, “Synthetic data is generated by an AI algorithm, which has been trained on a real data set.” Journalist Jonathan Vanian (@JonathanVanian) reports, “Using A.I. to create synthetic data for improving machine-learning models is hot.”[3]

Vanian also points out that protecting sensitive, personal data isn’t the only reason organizations use synthetic data. He reports that equipment manufacturer John Deere creates “synthetic images of plants in different weather conditions to improve its computer-vision systems that will eventually be used on tractors to spot weeds and spray weed killer on them, among other uses. Doing so saves the company the time and expense of manually photographing thousands of plants in every kind of lighting and surroundings, the company says. Without synthetic data its technology could more likely confuse weeds for crops, or vice versa.”

Mark van Rijmenam (@VanRijmenam), founder of Datafloq, explains that creating synthetic images uses a technique called Generative Adversarial Networks (GANs). He explains, “GANs have shown great promise by yielding some impressive results in generating new data from existing datasets. … [This] machine learning technique enables computer programs to create and modify data in many forms. They can be used to generate realistic-looking faces, for example, or to alter the audio of a recorded speech.”[4] As he infers, synthetic data isn’t generated out of thin air. He explains, “To get good results with most AI systems, there is a need for a vast dataset that the AI model can learn from.” That holds true for any technique used to generate synthetic data.

According to tech writer Maria Korolov (@MariaKorolov), generating synthetic data isn’t always a complicated matter, She writes, “Sometimes, generating synthetic data can be very simple. A list of names, for example, can be generated by combining a randomly chosen first name from a list of first names and a last name from a list of last names. Zip codes can be randomly picked from a list of Zip codes. That might be enough for some applications. For other purposes, however, the list may need to be balanced so that, say, synthetic spending data correlates to the usual spending patterns in those Zip Codes.”[5] Gil Elbaz (@RealGilbaz), CTO and Co-founder of Datagen, predicts the 2020s will characterized as the “Decade of Synthetic Data.”[6] He acknowledges, however, “It’s still a very young field. In fact, over 71% of the 21 companies in G2’s Synthetic Data software category were founded in just the past 4 years.”

Pros and Cons of Using Synthetic Data

Pros

Pundits almost unanimously agree that to compete in today’s business environment companies must be data-driven. That places data at the very heart of every business. At the same time, this reliance on data generally requires the capabilities embedded in cognitive technology solutions (aka artificial intelligence). Korolov notes, “Artificial data has many uses in enterprise AI strategies.” Synthetic data can be used in the following ways:

• For training models when real-world data is lacking.
• To fill gaps in training data.
• To balance out a data set.
• To speed up model development.
• To simulate the future or alternate futures.
• To simulate “black swan” events.
• To simulate the metaverse.
• To generate marketing imagery.
• For software testing.
• To create digital twins.
• To protect personal information in medical and financial data.
• For sales and marketing.
• To test AI systems for bias.

Journalist Isaac Sacolick (@nyike) concludes, “If you are developing algorithms or applications with high-dimensionality data inputs and critical quality and safety factors, then synthetic data generation provides a mechanism for cost-effectively creating large data sets.”[7] Data architecture consultant Hanz Qureshi (@hanzalaqureshi_) sums up the benefits of synthetic data this way, “If data is synthetic, it means: It does not need to be compliant with GDPR and other regulations; it can be made in abundance for a variety of conditions and drivers; data can be created for unencountered conditions; data can be well-cataloged; [and] data creation is highly cost-effective.”[8]

Cons

Although the benefits of using synthetic are numerous, challenges remain. Elbaz explains, “Although the way is already being paved for synthetic data, it would be misleading to suggest it doesn’t have obstacles of its own. Chief among those barriers are the three key resources — time, money, and talent. Creating a synthetic data organization is a major undertaking, requiring not only a large amount of capital, but also a large, multidisciplinary team with expertise in areas that have just barely moved beyond the experimental confines of academia and solid integrations between them.” Kot adds, “While synthetic data provides many benefits, using it correctly is rather complex. It’s especially difficult to ensure that it is as reliable to use as real-world data. Currently, when it comes to complicated data sets with a large number of different variables, it’s quite possible to make a synthetic data set that doesn’t properly represent real-world conditions. This can lead to false insight generation and, consequentially, to erroneous decision-making.”

Even more concerning, Kot adds, is the possibility that sensitive information isn’t protected when using synthetic data. He explains, “It can still be possible to link synthetic data to real people, especially if replication hasn’t been properly done. This can be a lavish opportunity for wrongdoers, as synthetic data sets will most likely relate to highly sensitive personal information. Currently, it’s rather unlikely for such a scenario to unfold, but in a future where synthetic data is the norm, it can become a real concern.”

Concluding Thoughts

It may be a bit premature to declare this the Decade of Synthetic Data. Nevertheless, Elbaz makes a great point when he notes that using the right data is often more important than selecting the best model. He explains, “Rather than focus hundreds of hours on fine-tuning one’s AI algorithm or model, researchers have realized that they can boost AI performance much more effectively by focusing on improving their training data. In a relatively short period of time, this complete reversal has gained widespread acceptance across the research and enterprise communities. And it isn’t because scientists love to be wrong. It’s because the data on data-centrism is undeniable — it works. And it works far better than modeling in practically every application imaginable.” As synthetic data improves, better results will follow.

Footnotes
[1] Ivan Kot, “The Pros and Cons of Synthetic Data,” Dataversity, 27 December 2021.
[2] Yossi Sheffi, “What is a Company’s Most Valuable Asset? Not People,” LinkedIn, 19 December 2018.
[3] Jonathan Vanian, “Why better A.I. may depend on fake data,” Fortune, 4 January 2022.
[4] Mark van Rijmenam, “Using GANs with Limited Data: How Synthetic Content Generation with AI Can Impact your Business,” Datafloq, 19 November 2021.
[5] Maria Korolov, “What is synthetic data? Generated data to help your AI strategy,” CIO, 15 March 2022.
[6] Gil Elbaz, “The Decade of Synthetic Data is Underway,” insideBIGDATA, 7 March 2022.
[7] Isaac Sacolick, “Use synthetic data for continuous testing and machine learning,” InfoWorld, 7 February 2022.
[8] Hanz Qureshi, “Why Synthetic Data Still Has a Data Quality Problem,” Dataversity, 4 February 2022.

Data Privacy Day 2023

Tomorrow, 28 January, is Data Privacy Day or, as it is known in Europe, Data Protection Day. You may never have heard of it; nevertheless,

The Confusing World of Data Storage: A Primer

We live in the Digital Age and business consultants are encouraging organizations to transform into digital enterprises. The foundation of this transformation is data, which