Justia - February 9, 2026

Peter Lee - Better than the Real Thing? The Promises and Perils of Synthetic Data - Feb 9, 2026

UC Davis Law professor Peter Lee discusses the growing use of...

Click here to view in your browser if you are having trouble viewing this email.

Verdict - Legal Analysis and commentary from Justia.

Better than the Real Thing? The Promises and Perils of Synthetic Data

Peter Lee

Feb 9, 2026

The rapid construction of massive data centers by Meta, OpenAI, Google, Microsoft, and Amazon reflects a fundamental feature of the AI economy: it runs on data. The Trump administration’s aggressive move to establish global AI dominance has underscored the importance of understanding the drivers of more powerful AI models—including massive amounts of data. While the federal government has sought to accelerate AI development, some states have taken a more cautious approach, as illustrated by a California law that went into effect last month compelling the disclosure of data that powers generative AI models. Whether as a lever for advancing or regulating this important technology, the data that fuels AI is a matter of significant public and private concern.

Data is critical to training machine learning (ML) models, which discern patterns in enormous datasets to generate content, make decisions, and render predictions. Huge amounts of so-called training data are necessary to teach ML models to perform an ever-increasing range of functions, such as generating images, identifying tumors in MRI scans, and driving autonomous vehicles. In general, the more data used to train a model, the better it performs.

Historically, developers have relied on real-world data to train ML models. However, the need to collect vast amounts of training data leads to several technical and legal challenges. Increasingly, AI firms are turning to synthetic data to train ML models, a development that offers significant promises as well as perils.

The Challenges of Real-World Training Data

Acquiring enough data to train ML models is difficult and expensive. The amount of data needed to train frontier models is staggering, often requiring trillions of tokens (words or parts of words). While the world produces a seemingly limitless amount of data, huge swaths of data are becoming inaccessible for training ML models. Historically, the internet has been a primary source of training data, especially for large language models (LLMs) that can mimic human expression. However, websites have begun blocking web crawlers used by OpenAI, Anthropic, Google, and others from scraping enormous amounts of data. Quite simply, much of the data used to train ML models “is drying up.”

To train ML models effectively, both the quantity and quality of data are important. However, reality is messy. Even if an AI firm can amass an enormous amount of real-world training data, such datasets are often incomplete or filled with errors. Furthermore, for at least some kinds of ML, training data must be labeled so that, for instance, a computer vision model can associate the image of a horse with the word “horse.” Consequently, AI firms spend significant resources to “clean” and prepare real-world data prior to training ML models.

Beyond the technical challenges of acquiring huge amounts of high-quality data, relying on real-world training data gives rise to several potential legal problems.

First, massive data collection by AI companies threatens privacy. As Lina Khan, former chair of the Federal Trade Commission, stated, AI models can be “trained on private emails, chats and sensitive data, ultimately exposing personal details and violating user privacy.” Massive web crawls to obtain training data can sweep up personal information obtained without consent.

Second, utilizing huge amounts of real-world data can introduce bias in ML models. As Khan further observed, “Because they may be fed information riddled with errors and bias, [AI] technologies risk automating discrimination—unfairly locking out people from jobs, housing or key services.” For instance, Amazon’s AI-based system for screening CVs from job applicants developed an anti-female bias because its dataset was overwhelmingly filled with CVs from male job applicants. When prompted to generate an image of a judge, only 3% of results from Stable Diffusion’s AI model depicted a female judge, when in fact 34% of U.S. judges are female. Such biased results arose from the overwhelming proportion of pictures of male judges in Stable Diffusion’s training data.

Finally, training ML models on real-world data may infringe copyrights on a massive scale. As mentioned, the internet is a primary source of training data for many AI models. Given the extremely low threshold for copyright protection, virtually all content on the web—from blog posts to New York Times articles—is copyrighted. AI firms that collect this data to train ML models may be infringing on billions of copyrights. Copyright owners have brought numerous lawsuits against AI firms, with potentially industry-altering liability hanging in the balance. While some courts have ruled that training AI models on copyrighted content constitutes fair use (at least in some contexts), litigation is ongoing, and significant uncertainty remains.

Synthetic Data: Artificial Data Training Artificial Intelligence

Given the technical and legal challenges of real-world data, AI developers are increasingly turning to a resource that sounds like (computer) science fiction: synthetic data. In general, synthetic data is artificially created text, numerical values, images, sounds, and other content used to train ML models.

While all synthetic data is in some ways based on real-world data, different kinds of synthetic data differ with respect to how “synthetic” they are. At one end of the spectrum lies data augmentation, where data scientists modify or tweak existing data. For instance, a data scientist may take a real-world photo of a kitten extending its right paw and synthesize a mirror image looking like the kitten is extending its left paw. Such synthetic data is highly proximate to real-world data.

Farther along the spectrum, model-based synthetic data arises from an ML model that has learned deep patterns in a dataset and can generate new outputs based on those patterns. One kind of model-based synthetic data is reflected in generative adversarial networks, the technology underlying deepfakes. In this fashion, AI models can generate a synthetic image of, say, Tom Cruise, which could then serve as training data for another model.

At the far end of the spectrum, simulators create entirely new virtual words and new universes of synthetic data. For example, when an autonomous vehicle company trains its model in a driving simulator, every virtual turn, acceleration, stop, and collision generates synthetic data to further refine the model.

While data scientists could train a model entirely on synthetic data, more often they use synthetic data to augment or extend an existing real-world data set. For example, autonomous vehicle companies combine millions of miles driven on actual roads with billions of miles driven in simulators to train their ML navigation models.

It is also important to distinguish between what could be called “conscientiously designed” and “inadvertent” synthetic data. The former, as the name suggests, refers to synthetic data that data scientists deliberately design to train ML models. As discussed further below, such data may fill the gaps of real-world datasets to improve the training of a model. “Inadvertent” synthetic data, on the other hand, represents artificially generated text, images, numerical values, and other content that was not designed to train ML models but inadvertently ends up doing so. For instance, as generative AI models create more and more content on the web—including AI slop—that synthetic content ends up serving as the training data for the next generation of ML models.

Whatever its provenance, synthetic data will play a key role in training the ML models of tomorrow. Based on past projections, it is likely that already a majority of the data training today’s ML models is synthetic. Sam Altman, CEO of OpenAI, once stated that he is “pretty confident that soon all data will be synthetic data.”

Is Synthetic Data Better than Reality?

The great promise of synthetic data is that it can help overcome many of the technical and legal limitations of real-world data used to train ML models.

From a technical standpoint, synthetic data offers the prospect of virtually unlimited, high-quality, and labeled training data. Among other benefits, synthetic datasets can include adequate instances of “edge cases”—statistically improbable but possible events—to enhance an algorithm’s learning. For instance, a driving simulator can include a (synthetic) truck with a piano falling out if its back to train an ML model that navigates autonomous vehicles. Such an edge case may never arise in millions of miles of real-world driving, but it is useful for the model to be familiar with such a contingency. Synthetic data can help drive the development of more powerful ML models, thus leading to more elaborate generated images, more accurate diagnoses of disease, and safer autonomous vehicles.

Synthetic data can also mitigate several of the legal difficulties of real-world training data. First, synthetic data can allay privacy concerns by using fabricated names, demographic characteristics, and other content—rather than actual personal information—to train models. For instance, startup Syntegra and the National Institutes of Health are developing a synthetic dataset of COVID-19 patient records that shares the statistical properties of real-world data but does not reveal any personal information. Researchers could use this synthetic data to train ML models, thus yielding important insights about the disease and its spread.

Additionally, synthetic data can mitigate bias in ML models. If a real-world dataset features statistically known deviations from ground-truth reality, augmenting that dataset with synthetic data can help reduce that bias. Recall that when prompted to produce an image of a judge, Stable Diffusion only generated an image of a female judge 3% of the time—even though 34% of U.S. judges are female. Augmenting that real-world data set with synthetically generated images of female judges could help reduce this bias. In some ways, synthetic data may be more representative of reality than real-world data.

Finally, synthetic data can help avoid massive copyright infringement. AI companies could train their models on synthetically generated text, images, and sounds instead of copyrighted works. OpenAI CEO Sam Altman has praised synthetic data as a means of avoiding copyright infringement when training huge ML models. Relatedly, content wholly generated by AI is not copyrightable for failure to satisfy the authorship requirement of copyrightability. As such, most outputs of generative AI are not copyrightable. Subject to complications discussed below, training ML models on such synthetic content would not constitute copyright infringement.

The Perils of Synthetic Data

While synthetic data addresses several technical and legal challenges of real-world training data, it is far from a panacea. Indeed, poorly designed and deployed synthetic data can amplify rather than mitigate current harms.

Of course, low-quality synthetic data can be highly problematic. Training ML models on inaccurate or misrepresentative data can undermine the increasingly important functions that such models perform. For instance, IBM’s Watson Health gave incorrect cancer treatment advice because it was trained on erroneous synthetic data.

More generally, researchers have shown that ML models that recursively train on synthetic data can experience “model collapse.” In an iterative fashion, generative AI models can create synthetic content, which then serves as their own training data. Such ingestion and production of synthetic data can lead models to become irretrievably divorced from reality. Such a phenomenon may be occurring due to the enormous amount of “inadvertent” synthetic data floating on the web. As AI-generated content proliferates online and then becomes the training data for those same models, AI systems can degenerate markedly.

Synthetic data also poses specific risks for the three legal challenges discussed above. For instance, synthetic data that is too similar to reality can reveal personal information, thus compromising individual privacy. As synthetic data becomes less distinguishable from real-world data, it may become easier to reconstruct the underlying data. Such synthetic datasets could “leak” an unacceptable level of identifying information.

Furthermore, attempts to use synthetic data to mitigate bias face significant complications. Correcting for one bias can introduce others. For instance, adding synthetically generated photographs of female judges to a dataset may help with gender representation, but those synthetic images can exacerbate biases regarding other demographic characteristics.

Additionally, the prospect of avoiding copyright infringement by training ML models on synthetic data also faces challenges. As mentioned, content wholly generated by generative AI is not copyrightable. However, other types of synthetic data not arising from generative AI could, in theory, be copyrighted, thus raising the possibility of copyright infringement if data scientists used such data to train ML models. Moreover, content created by generative AI that is then further refined or manipulated by humans may satisfy the authorship requirement of copyrightability and be subject to intellectual property rights.

Even if synthetic data is not copyrighted, it may infringe others’ copyrights. For example, consider a text-to-image generator that produces images that are substantially similar to real-world (copyrighted) photographs. If an ML model trains on such infringing synthetic data, the model may be infringing copyrights as well.

A final consideration implicates not just synthetic data but the promises and perils of AI more generally. While low-quality synthetic data is problematic, high-quality data also poses important risks that policymakers should consider. Virtually infinite volumes of high-quality synthetic data can vastly expand the analytic and predictive power of ML models—for good or for ill. This article has highlighted many of the beneficial uses of AI in society, such as extending human creativity, diagnosing diseases, and navigating autonomous vehicles. However, more powerful ML models—trained on huge amounts of synthetic data—could also produce more dangerous deepfakes, disinformation, and cybersecurity threats. While synthetic data enjoys many advantages over real-world data, responsible development and deployment of this resource is critical. As the federal government and states seek to both promote and regulate AI, the rise of synthetic data increases the urgency of trenchant policy questions about the appropriate role and limits of this transformative technology in society.

Peter Lee is the Martin Luther King Jr. Professor of Law and Director of the Center for Innovation, Law, and Society, at UC Davis School of Law.