Imagine if it were possible to produce infinite amounts of the world’s most valuable resource, cheaply and quickly. What dramatic economic transformations and opportunities would result?
This is a reality today. It is called synthetic data.
Synthetic data is not a new idea, but it is now approaching a critical inflection point in terms of real-world impact. It is poised to upend the entire value chain and technology stack for artificial intelligence, with immense economic implications.
Data is the lifeblood of modern artificial intelligence. Getting the right data is both the most important and the most challenging part of building powerful AI. Collecting quality data from the real world is complicated, expensive and time-consuming. This is where synthetic data comes in.
Synthetic data is an elegantly simple concept—one of those ideas that seems almost too good to be true. In a nutshell, synthetic data technology enables practitioners to simply digitally generate the data that they need, on demand, in whatever volume they require, tailored to their precise specifications.
According to a widely referenced Gartner study, 60% of all data used in the development of AI will be synthetic rather than real by 2024.
Take a moment to digest this. This is a striking prediction.
Data is the foundation of the modern economy. It is, in the words of The Economist, “the world’s most valuable resource.” And within a few short years, the majority of the data used for AI may come from a disruptive new source—one that few companies today understand or even know about.
Needless to say, massive business opportunities will result.
“We can simply say that the total addressable market of synthetic data and the total addressable market of data will converge,” said Ofir Zuk, CEO/cofounder of synthetic data startup Datagen.
The rise of synthetic data will completely transform the economics, ownership, strategic dynamics, even (geo)politics of data. It is a technology worth paying attention to.
From Autonomous Vehicles to Human Faces
While the concept of synthetic data has been around for decades, it was in the autonomous vehicle sector that the technology first found serious commercial adoption starting in the mid-2010s.
It is no surprise that synthetic data got its start in the world of autonomous vehicles. To begin with, because the AV sector has attracted more machine learning talent and investment dollars than perhaps any other commercial application of AI, it is often the catalyst for foundational innovations like synthetic data.
Synthetic data and autonomous vehicles are a particularly natural fit for one another given the challenges and importance of “edge cases” in the world of AVs. Collecting real-world driving data for every conceivable scenario an autonomous vehicle might encounter on the road is simply not possible. Given how unpredictable and unbounded the world is, it would take literally hundreds of years of real-world driving to collect all the data required to build a truly safe autonomous vehicle.
So instead, AV companies developed sophisticated simulation engines to synthetically generate the requisite volume of data and efficiently expose their AI systems to the “long tail” of driving scenarios. These simulated worlds make it possible to automatically produce thousands or millions of permutations of any imaginable driving scenario—e.g., changing the locations of other cars, adding or removing pedestrians, increasing or decreasing vehicle speeds, adjusting the weather, and so on.
For years now, the leading autonomous vehicle players—Waymo, Cruise, Aurora, Zoox—have all invested heavily in synthetic data and simulation as a core part of their technology stack. In 2016, for instance, Waymo generated 2.5 billion miles of simulated driving data to train its self-driving system (compared to 3 million miles of driving data collected from the real world). By 2019, that figure had reached 10 billion simulated miles.
As Andreessen Horowitz general partner Chris Dixon put it back in 2017: “Right now, you can almost measure the sophistication of an autonomy team—a drone team, a car team—by how seriously they take simulation.”
The first batch of synthetic data startups that emerged thus targeted the autonomous vehicle end market. This included companies like Applied Intuition (most recently valued at $3.6 billion), Parallel Domain and Cognata.
But it didn’t take long for AI entrepreneurs to recognize that the synthetic data capabilities that had been developed for the autonomous vehicle industry could be generalized and applied to a host of other computer vision applications.
From robotics to physical security, from geospatial imagery to manufacturing, computer vision has found a wide range of valuable applications throughout the economy in recent years. And for all of these use cases, building AI models requires massive volumes of labeled image data.
Synthetic data represents a powerful solution here.
Using synthetic data methods, companies can acquire training data far more quickly and cheaply than the alternative—laboriously collecting that data from the real world. Imagine how much easier it is to artificially generate 100,000 images of, say, smartphones on an assembly line than it is to collect those images in the real world one by one.
And importantly, real-world image data must be labeled by hand before it can be used to train AI models—an expensive, time-consuming, error-prone process. A key advantage of synthetic data is that no manual data labeling is needed: because the images are digitally tailor-made from scratch in the first place, they automatically come with “pixel-perfect” labels.
How, exactly, does synthetic data for computer vision work? How is it possible to artificially generate such high-fidelity, photorealistic image data?
A key AI technology at the heart of synthetic data is known as generative adversarial networks, or GANs.
GANs were invented by AI pioneer Ian Goodfellow in 2014 and have been an active area of research and innovation since then. Goodfellow’s core conceptual breakthrough was to architect GANs with two separate neural networks—and then pit them against one another.
Starting with a given dataset (say, a collection of photos of human faces), the first neural network (called the “generator”) begins generating new images that, in terms of pixels, are mathematically similar to the existing images. Meanwhile, the second neural network (the “discriminator”) is fed photos without being told whether they are from the original dataset or from the generator’s output; its task is to identify which photos have been synthetically generated.
As the two networks iteratively work against one another—the generator trying to fool the discriminator, the discriminator trying to suss out the generator’s creations—they hone one another’s capabilities. Eventually the discriminator’s classification success rate falls to 50%, no better than random guessing, meaning that the synthetically generated photos have become indistinguishable from the originals.
In 2016, AI great Yann LeCun called GANs “the most interesting idea in the last ten years in machine learning.”
Two other important research advances driving recent momentum in visual synthetic data are diffusion models and neural radiance fields (NeRF).
Originally inspired by concepts from thermodynamics, diffusion models learn by corrupting their training data with incrementally added noise and then figuring out how to reverse this noising process to recover the original image. Once trained, diffusion models can then apply these denoising methods to synthesize novel “clean” data from random input.
Diffusion models have seen a surge in popularity over the past year, including serving as the technological backbone of DALL-E 2, OpenAI’s much-discussed new text-to-image model. With some meaningful advantages over GANs, expect to see diffusion models play an increasingly prominent role in the world of generative AI moving forward.
NeRF, meanwhile, is a powerful new method to quickly and accurately turn two-dimensional images into complex three-dimensional scenes, which can then be manipulated and navigated to produce diverse, high-fidelity synthetic data.
Two leading startups offering synthetic data solutions for computer vision are Datagen (which recently announced a $50 million Series B) and Synthesis AI (which recently announced a $17 million Series A). Both companies specialize in human data, in particular human faces; their platforms enable users to programmatically customize facial datasets across dimensions including head poses, facial expressions, ethnicities, gaze directions and hair styles.
AI.Reverie, an early mover in this category, was scooped up last year by Facebook—a sign of big tech’s growing interest in synthetic data. Earlier-stage startups include Rendered.ai, Bifrost and Mirage.
Coming full circle, while autonomous vehicles provided the original impetus for the growth of synthetic data several years ago, to this day the autonomous vehicle sector continues to push forward the state of the art in the field.
One of the most intriguing new startup entrants in the autonomous vehicle category, Waabi, has taken simulation technology to the next level. Founded by AI luminary Raquel Urtasun, who previously ran Uber’s AV research efforts, Waabi came out of stealth last year with a star-studded team and over $80 million in funding.
Waabi’s ambition is to leapfrog the more established AV players by harnessing next-generation AI to build a new type of autonomy stack that avoids the shortcomings of more legacy approaches. At the center of that stack is synthetic data.
In a break from the rest of the AV field, Waabi does not invest heavily in deploying cars on real-world roads to collect driving data. Instead, audaciously, Waabi is seeking to train its autonomous system primarily via virtual simulation. In February the company publicly debuted its cutting-edge simulation platform, named Waabi World.
“At Waabi, we go one step further in generating synthetic data,” said Urtasun. “Not only can we simulate the vehicle’s sensors with unprecedented fidelity in near real-time, but we do so in a closed-loop manner such that the environment reacts to us and we react to it. This is very important for robotics systems such as self-driving vehicles as we not only need to learn to perceive the world but also to act safely on it.”
The Primacy of Language
While synthetic data will be a game-changer for computer vision, the technology will unleash even more transformation and opportunity in another area: language.
The vast potential for text-based synthetic data reflects the basic reality that language is ubiquitous in human affairs; it is at the core of essentially every important business activity. Dramatic recent advances in natural language processing (NLP) are opening up virtually unbounded opportunities for value creation across the economy, as previously explored in this column. Synthetic data has a key role to play here.
A couple concrete examples will help illustrate the possibilities.
Anthem, one of the largest health insurance companies in the world, uses its troves of patient medical records and claims data to power AI applications like automated fraud detection and personalized patient care.
Last month, Anthem announced that it is partnering with Google Cloud to generate massive volumes of synthetic text data in order to improve and scale these AI use cases. This synthetic data corpus will include, for instance, artificially generated medical histories, healthcare claims and related medical data that preserve the structure and “signal” of real patient data.
Among other benefits, synthetic data directly addresses the data privacy concerns that for years have held back the deployment of AI in healthcare. Training AI models on real patient data presents thorny privacy issues, but those issues disappear when the data is synthetic.
“More and more…synthetic data is going to overtake and be the way people do AI in the future,” said Anthem’s Chief Information Officer Anil Bhatt.
Another recent example hints at even more transformative possibilities.
Late last year Illumina, the world’s leading genetic sequencing company, announced that it was partnering with Bay Area startup Gretel.ai to create synthetic genomic datasets.
Genomic data is one of the most complex, multi-dimensional, information-rich types of data in the world. Quite literally, it contains the secrets of life—the instructions for how to build an organism. Just over 3 billion base-pairs in length, every human’s unique DNA sequence defines much about who they are, from their height to their eye color to their risk of heart disease or substance abuse. (While not natural language, genomic sequences are textual data; every individual’s DNA sequence can be encoded via a simple 4-letter “alphabet”.)
Analyzing the human genome with cutting-edge AI is enabling researchers to develop a deeper understanding of disease, health, and how life itself works. But this research has been bottlenecked by the limited availability of genomic data. Stringent privacy regulations and data-sharing restrictions surrounding human genetic data impede researchers’ ability to work with genomic datasets at scale.
Synthetic data offers a potentially revolutionary solution: it can replicate the characteristics and signal of real genomic datasets while sidestepping these data privacy concerns, since the data is artificially generated and does not correspond to any particular individuals in the real world.
These two examples are just the tip of the iceberg when it comes to the wide range of language-based opportunities unlocked by synthetic data.
A handful of promising startups has emerged in recent years to pursue these opportunities.
The most prominent startup in this category is Gretel.ai, mentioned above, which has raised over $65 million to date from Greylock and others.
Gretel has seen strong market demand for its technology from blue-chip customers across industries, from healthcare to financial services to gaming to e-commerce.
“It’s amazing to see customers start to adopt synthetic data at such a rapid pace,” said Gretel.ai CEO/cofounder Ali Golshan. “The awareness and appetite for synthetic data in the enterprise is growing incredibly quickly, even compared to 12 or 18 months ago. Our customers continue to surprise us with innovative new ways to apply our technology.”
Another intriguing early-stage player in this space is DataCebo. DataCebo was founded by a group of MIT faculty and their students who had previously created Synthetic Data Vault (SDV), the largest open-source ecosystem of models, data, benchmarks, and other tools for synthetic data. DataCebo and Synthetic Data Vault focus on structured (i.e., tabular or relational) text datasets—that is, text data that is organized in tables or databases.
“The most important dynamic to understand with this technology is the tradeoff between fidelity and privacy,” said DataCebo cofounder Kalyan Veeramachaneni. “The core of what the DataCebo platform offers is a finely-tuned knob that enables customers to ramp up the privacy guarantees around the synthetic data that they are generating, but at the cost of fidelity, or vice versa.”
Tonic.ai is another buzzy startup offering tools for synthetically generated textual data. Tonic’s primary use case is synthetic data for software testing and development, rather than for building machine learning models.
One last startup worth noting is Syntegra, which focuses on synthetic data specifically for healthcare, with use cases spanning healthcare providers, health insurers and pharmaceutical companies. Synthetic data’s long-term impact may be greater in healthcare than in any other field, given the market size and the thorny privacy challenges of real-world patient data.
It is worth noting that, for the most part, the companies and examples discussed here use classical statistical methods or traditional machine learning to generate synthetic data, with a focus on structured text. But over the past few years, the world of language AI has been revolutionized by the introduction of the transformer architecture and the emerging paradigm of massive “foundation models” like OpenAI’s GPT-3.
An opportunity exists to build next-generation synthetic data technology by harnessing large language models (LLMs) to produce unstructured text (or multimodal) data corpuses of previously unimaginable realism, originality, sophistication and diversity.
“Recent advances in large language models have brought us machine-generated data that is often indistinguishable from human-written text,” said Dani Yogatama, a senior staff research scientist at DeepMind who focuses on generative language models. “This new type of synthetic data has been successfully applied to build a wide range of AI products, from simple text classifiers to question-answering systems to machine translation engines to conversational agents. Democratizing this technology is going to have a transformative impact on how we develop production AI models.”
The Sim-to-Real Gap
Taking a step back, the fundamental conceptual challenge in this field is that synthetically generated data must be similar enough to real data to be useful for whatever purpose the data is serving. This is the first question that most people have when they learn about synthetic data—Can it really be accurate enough to substitute for real data?
A synthetic dataset’s degree of similarity to real data is referred to as its fidelity. It is important for us to ask: how high-fidelity does synthetic data need to be in order to be useful? Have we gotten there yet? How can we measure and quantify fidelity?
Recent advances in AI have dramatically boosted the fidelity of synthetic data. For a wide range of applications across both computer vision and natural language processing, today’s synthetic data technology is advanced enough that it can be deployed in production settings. But there is more work to do here.
In computer vision, the “sim-to-real gap”, as it is colloquially known, is narrowing quickly thanks to ongoing deep learning innovations like neural radiance fields (NeRF). The release of developer platforms like Nvidia’s Omniverse, a cutting-edge 3D graphics simulation platform, plays an important role here by making state-of-the-art synthetic data capabilities widely accessible to developers.
The most direct way to measure the efficacy of a given synthetic dataset is simply to swap it in for real data and see how an AI model performs. For instance, computer vision researchers might train one classification model on synthetic data, train a second classification model on real data, deploy both models on the same previously unseen test dataset, and compare the two models’ performance.
In practice, the use of synthetic data in computer vision need not be, and generally is not, this binary. Rather than using only real data or only synthetic data, researchers can drive significant performance improvements by combining real data and synthetic data in their training datasets, enabling the AI to learn from both and boosting the overall size of the training corpus.
It is also worth noting that synthetic datasets sometimes actually outperform real-world data. How is this possible?
The fact that data was collected from the real world does not guarantee that it is 100% accurate and high-quality. For one thing, real-world image data generally must be labeled by hand by a human before it can be used to train an AI model; this data labeling can be inaccurate or incomplete, degrading the AI’s performance. Synthetic data, on the other hand, automatically comes with perfect data labels. Moreover, synthetic datasets can be larger and more diverse than their real-world counterparts (that’s the whole point, after all), which can translate into superior AI performance.
For text data, industry practitioners have begun to develop metrics to quantify and benchmark the fidelity of synthetic data.
Gretel.ai, for instance, grades its synthetic datasets on three different statistically rigorous metrics—Field Correlation Stability, Deep Structure Stability, and Field Distribution Stability—which it aggregates to produce an overall Synthetic Data Quality Score between 0 and 100. Put simply, this overall figure represents “a confidence score as to whether scientific conclusions drawn from the synthetic dataset would be the same if one were to have used the original dataset instead.”
Gretel’s synthetic data generally performs quite well: AI models trained on it typically come within a few percentage points in accuracy relative to models trained on real-world data, and are sometimes even more accurate.
Fellow synthetic data startup Syntegra has likewise proposed thoughtful analytical frameworks for evaluating synthetic data fidelity in the healthcare context.
For text data, a basic tradeoff exists between fidelity and privacy: as the synthetic data is made increasingly similar to the real-world data on which it is based, the risk correspondingly increases that the original real-world data can be reconstructed from the synthetic data. If that original real-world data is sensitive—medical records or financial transactions, say—this is a problem. A core challenge for synthetic text data, therefore, is not just to maximize fidelity in a vacuum, but rather to maximize fidelity while preserving privacy.
The Road Ahead
The graph below speaks volumes. Synthetic data will completely overshadow real data in AI models by 2030, according to Gartner.
As synthetic data becomes increasingly pervasive in the months and years ahead, it will have a disruptive impact across industries. It will transform the economics of data.
By making quality training data vastly more accessible and affordable, synthetic data will undercut the strength of proprietary data assets as a durable competitive advantage.
Historically, no matter the industry, the most important first question to ask in order to understand the strategic dynamics and opportunities for AI has been: who has the data? One of the main reasons that tech giants like Google, Facebook and Amazon have achieved such market dominance in recent years is their unrivaled volumes of customer data.
Synthetic data will change this. By democratizing access to data at scale, it will help level the playing field, enabling smaller upstarts to compete with more established players that they otherwise might have had no chance of challenging.
To return to the example of autonomous vehicles: Google (Waymo) has invested billions of dollars and over a decade of effort to collect many millions of miles of real-world driving data. It is unlikely that any competitor will be able to catch up to them on this front. But if production-grade self-driving systems can be built almost entirely with synthetic training data, then Google’s formidable data advantage fades in relevance, and young startups like Waabi have a legitimate opportunity to compete.
The net effect of the rise of synthetic data will be to empower a whole new generation of AI upstarts and unleash a wave of AI innovation by lowering the data barriers to building AI-first products.
An interesting related impact of the proliferation of synthetic data will be to diminish the need for and the importance of data labeling, since synthetically generated data does not need to be labeled by hand.
Data labeling has always been a kludgy, inelegant part of the modern machine learning pipeline. Intuitively, truly intelligent agents (like human beings) should not need to have labels manually attached to every object they observe in order to recognize them.
But because labeled data is necessary under today’s AI paradigm, data labeling has itself become a massive industry; many companies spend tens or hundreds of millions of dollars each year just to get their data labeled. Scale AI, the leading provider of data labeling services, was valued at $7.3 billion last year amid eye-popping revenue growth. An entire ecosystem of smaller data labeling startups has likewise emerged.
Synthetic data will threaten these companies’ livelihoods. Seeming to recognize this, Scale AI is now aiming to get into the synthetic data game itself, launching a synthetic data platform earlier this year called Scale Synthetic. (Clay Christensen adherents might recognize elements of his famous “innovator’s dilemma” here.)
Synthetic data technology will reshape the world of AI in the years ahead, scrambling competitive landscapes and redefining technology stacks. It will turbocharge the spread of AI across society by democratizing access to data. It will serve as a key catalyst for our AI-driven future. Data-savvy individuals, teams and organizations should take heed.
Note: The author is a Partner at Radical Ventures, which is an investor in Waabi.