The lifeblood of the Digital Age is data. Between people, businesses, and machines record amounts of data are being generated every day. So why do we need synthetic data? Fernando Lucini (@fernandolucini), a global data science and machine learning engineering lead at Accenture Applied Intelligence, explains, “Data is the essential fuel driving organizations’ advanced analytics and machine learning initiatives, but between privacy concerns and process issues, it’s not always easy for researchers to get their hands on what they need.”[1] Sigal Shaked, CTO and Co-Founder of Datomize, a business that promotes the use of synthetic data, agrees with Lucini’s assessment. She writes, “Organizations often lack enough data to run AI/ML models used to generate customer insights, optimize operations detect fraud, and more. Data collection is often expensive and time-consuming, and privacy regulations can prohibit organizations from using the data they collect.”[2] Synthetic data can help overcome some of those challenges.
As Lucini and Shaked note, one of the primary drivers behind the use of synthetic data is privacy. Science writer Anjana Ahuja (@anjahuja) reports, “In an algorithm-driven world where data is king, one misstep can lead to a royal mess. Netflix discovered this in 2009 when it released anonymized movie reviews penned by subscribers. By cross-matching those snippets with reviews on another website, data sleuths revealed they could identify individual subscribers and what they had been watching. A gay customer sued for breach of privacy; Netflix settled. That episode is still cited today by academics seeking ways of sifting useful information from data without outing the individuals who provide it. Where anonymization failed, synthetic data might yet succeed.”[3]
What is Synthetic Data?
Most data is created by real-world activities. That’s not the case for synthetic data. Journalist Nicole Laskowski (@TT_Nicole) explains, “Synthetic data is information that’s artificially manufactured rather than generated by real-world events. Synthetic data is created algorithmically, and it is used as a stand-in for test datasets of production or operational data, to validate mathematical models and, increasingly, to train machine learning models.”[4] That explanation begs the question: Doesn’t synthetic data produce fake results? According to Ahuja, natural results are obtained using synthetic data when that data draws on real-world data. She writes, “[Synthetic data] is most often created by funneling real-world data through a noise-adding algorithm to construct a new data set. The resulting data set captures the statistical features of the original information without being a giveaway replica. Its usefulness hinges on a principle known as differential privacy: that anybody mining synthetic data could make the same statistical inferences as they would from the true data — without being able to identify individual contributions.”
In addition to answering questions about the nature of synthetic data and how it’s generated, Shaked delves deeper into synthetic data by posing and answering several additional questions. They are: Where is synthetic data used? Which problems does synthetic data solve? How is synthetic data preferable to masking or other options? And, what is the future of synthetic data?
Where is synthetic data used?
Shaked notes, “AI/ML models are becoming the norm across many industries, but they are the most pervasive in the financial community. By ingesting raw information in large data sets, understanding patterns and correlations, and drawing inferences, machine learning insights improve trading performance, streamline processes, reduce risk, and improve customer service. They are in high demand for specialized applications such as KYC (know your customer), NBO (next best offer), and risk management.” As stricter privacy laws start taking effect around the world, more economic sectors are likely to use synthetic data to extract actionable insights from the data they collect. Lucini asserts, “The technology has potential across a range of industries. In financial services, where restrictions around data usage and customer privacy are particularly limiting, companies are starting to use synthetic data to help them identify and eliminate bias in how they treat customers — without contravening data privacy regulations. And retailers are seeing the potential for new revenue streams derived from selling synthetic data on customers’ purchasing behavior without revealing personal information.”
Which problems does synthetic data solve?
Laskowski notes, “The benefits of using synthetic data include reducing constraints when using sensitive or regulated data, tailoring the data needs to certain conditions that cannot be obtained with authentic data and generating datasets for software testing and quality assurance purposes for DevOps teams.” Shaked adds, “AI/ML models are starved for data. Linear algorithms need hundreds of examples per class, while more complex algorithms need tens of thousands to millions of data sets. … Even if the data is safe and representative of every segment of the population, it can still be unusable because it’s incomplete, irrelevant, or out of date. Many enterprises have data inconsistencies because data resides in silos in different regions, business units, and geographies.” Ahuja believes fraud detection is one problem for which synthetic data is particularly useful. She explains, “Uncovering fraud can be challenging because regulations restrict how information can be shared, even within banks. Synthetic data can help to unveil useful patterns, while masking individual incidents.”
How is synthetic data preferable to masking or other options?
As Ahuja pointed out, masking doesn’t always work. Shaked adds, “Anonymization techniques, like data generalization, pseudo-anonymization, data masking, or perturbation blur the data, making it less accurate for analysis since the data loses important characteristics. In addition, hackers can easily reconstruct the original data by using external information. On the other hand, synthetic data is safer because it can’t be reversed back to an original record. The synthetic data is constructed following the same characteristics as the original data without revealing customer identities.”
What is the future of synthetic data?
Ahuja notes, “The idea of synthetic data was first floated in the 1990s, but the rise in machine learning and computing power, coupled with stricter regulations around data management, now makes it a technology to watch.” She also observes, “A [2019] report by the UK’s Office for National Statistics said it offered a ‘safer, easier and faster way to share data between government, academia and the private sector’.” Since her company works with synthetic data, it comes as no surprise that Shaked believes the future of synthetic data is bright. She writes, “I believe that obtaining the data you need will be simple in the future. Data synthesis will be as easy as copy-paste. Generating data for innovation will be embedded in the development process, taking place behind the scenes. Obtaining data will no longer be an obstacle for AI/ML models but will be a plentiful resource, enabling them to generate powerful insights.”
Concluding Thoughts
Although the use of synthetic data looks promising, Lucini adds a note of caution. He writes, “While the benefits of synthetic data are compelling, realizing them can be difficult. Generating synthetic data is an extremely complex process, and to do it right, an organization needs to do more than just plug in an AI tool to analyze its data sets. The task requires people with specialized skills and truly advanced knowledge of AI. A company also needs very specific, sophisticated frameworks and metrics that enable it to validate that it created what it set out to create. This is where things become especially difficult.” Laskowski adds, “Drawbacks [of using synthetic data] include inconsistencies when trying to replicate the complexity found within the original dataset and the inability to replace authentic data outright, as accurate authentic data is still required to produce useful synthetic examples of the information.” If natural results (i.e., actionable insights) are to be obtained from synthetic data, the process begins with data generated by real-world events. As the algorithms that produce synthetic data improve, the results will become more meaningful and useful. In a perfect world, Ahuja notes, synthetic data would not be “fake but idealized, used to hone perfectly just and fair algorithms in a digital paradise where data is still king but ruling as benevolent monarch rather than prejudiced patriarch.” We’re not there yet.
Footnotes
[1] Fernando Lucini, “The Real Deal About Synthetic Data,” MIT Sloan Management Review, 20 October 2021.
[2] Sigal Shaked, “Six Questions About Synthetic Data,” Dataversity, 24 March 2021.
[3] Anjana Ahuja, “The promise of synthetic data,” Financial Times, 4 February 2020.
[4] Nicole Laskowski, “Synthetic Data,” TechTarget, February 2018.