The promise of "Data for All" meets concerns about quality, quantity and bias. Susannah Matschke, Head of Data & AI for Sopra Steria Next in the UK, explores whether synthetic data represents AI's salvation or its next challenge.
Generative artificial intelligence has an insatiable appetite for data. As models grow larger and more sophisticated, their hunger for training data increases exponentially. Yet organizations worldwide face the same persistent challenge: insufficient high-quality, diverse datasets to feed their AI systems effectively. Enter synthetic data: artificially generated information that mimics real-world patterns without compromising individual privacy or requiring expensive data collection processes.
But as it emerges as a potential solution to AI's data shortage, questions arise about whether we're solving one problem only to create others. Can artificially generated data truly deliver on its promise of "Data for All," or are we setting ourselves up for what Canadian technology critic Cory Doctorow calls a "coprophagic AI crisis", systems feeding on their own digital waste?
The data dilemma: scarcity in an age of abundance
The modern AI landscape presents a fundamental paradox. While we generate more data than ever before, accessing quality, representative datasets for AI training remains challenging. Regulatory constraints around privacy, the high costs of data collection, and the scarcity of edge cases in real-world datasets create bottlenecks that limit AI development.
"Synthetic data has found its place in AI because enormous amounts of data are needed to train algorithms," explains Susannah Matschke, head of data & AI for Sopra Steria Next in the UK. "When there isn't enough data or the quality isn't up to par for building the desired models, that's where synthetic data comes in."
This scarcity becomes particularly acute in regulated industries like healthcare and finance, where data sensitivity makes sharing and access extremely difficult. Imagine a pharmaceutical company racing to develop a breakthrough treatment. They have groundbreaking algorithms ready to identify promising drug compounds, but there's a catch: accessing real patient data is restricted by privacy regulations, and what's available represents only a narrow demographic slice. The clock is ticking, lives hang in the balance, and traditional data collection could take years. Synthetic data offers a pathway to unlock AI development in these critical sectors while maintaining privacy and compliance standards.
The "Data for All" Promise
Synthetic data represents more than just a technical workaround; it embodies a democratic vision for AI development. By generating artificial datasets that preserve statistical properties while eliminating personal identifiers, organisations can share valuable training data without privacy concerns.
"Synthetic data is excellent because it eliminates the risk of using someone's personal data," Matschke notes. "This is a game changer in AI development, as often the biggest obstacle is access to quality and diverse data, especially in sectors where privacy regulations or costs make data difficult to obtain."
The technology excels at modelling unusual scenarios that are difficult to find in real-world data. For autonomous vehicles, synthetic data can generate thousands of edge cases (from extreme weather conditions to unusual pedestrian behaviour) that would be impossible or dangerous to collect in reality.
From a sustainability perspective, synthetic data offers compelling advantages. "Collecting, storing, and processing real-world data can be very energy-intensive," Matschke explains. "With synthetic data, you can generate exactly what you need when you need it, which reduces computational and storage costs and generally translates to a lower environmental impact."
Technical innovation and quality control
The technical process of generating synthetic data requires careful calibration to ensure realism and utility. "You need real data to understand what the data should look like, what the value ranges are," Matschke explains. "For a simple example like people's ages, you want it to start at zero and go up to 100-110 years, not generate ages of 200 or 500 years."
Understanding these boundaries and distributions becomes crucial for practical applications. "If you look at the ages of credit card applicants, they're generally between 18 and 50 years old—you don't want to generate five-year-old children applying for credit cards," she adds.
This human-in-the-loop approach allows organisations to maintain control over the generation process, influencing outputs and establishing limits to avoid bias. The result is data that maintains statistical integrity while serving specific training needs.
What happens when AI feeds on AI ?
Despite these advantages, synthetic data faces significant challenges that echo concerns raised by Doctorow. In his essay "The Coprophagic AI Crisis," Doctorow warns of a future where AI models increasingly train on data generated by other AI systems, creating what researchers call "model collapse."
"There's a legitimate concern of 'model collapse,'" Matschke acknowledges. "When you use one model to generate data intended to train another model, you create a circular scenario where the errors, limitations, and biases of one AI repeat and amplify. Over time, you can lose the nuance, precision, and diversity that make data unique and important."
Doctorow's analysis reveals the mathematical dangers of this recursive training. As he notes, research shows that "training an AI on the output of another AI makes it exponentially worse." The proliferation of AI-generated content across the internet threatens to contaminate future training datasets with increasingly degraded information.
This contamination risk extends beyond synthetic data generation to the broader challenge of maintaining data quality in an AI-saturated information environment. As Doctorow observes, "the quantum of human-generated 'content' in any internet core sample is dwindling to homeopathic levels."
Governance and best practices
Addressing these challenges requires robust governance frameworks that balance innovation with quality control. "Governance is essential in AI," Matschke emphasises. "Organisations need clear documentation on synthetic data usage and generation, with regular audits to detect bias. Strong internal oversight is needed, whether through ethics committees or independent review committees."
The solution lies not in avoiding synthetic data but in implementing it thoughtfully. "The solution is to use synthetic data to complement real data, not completely replace it, and to regularly retrain these models with real-world data," Matschke explains.
This hybrid approach acknowledges synthetic data's limitations while leveraging its strengths. In high-stakes environments like healthcare or transportation, synthetic data should support rather than replace real-world datasets. "For diagnostic medical tools or autonomous vehicles, even a small deviation from the real scenario could cause enormous impact. This should really be used to support, not replace real data," she says.
A sustainable path
Looking ahead, the role of synthetic data in AI development appears both promising and nuanced. Rather than representing a complete solution to AI's data challenges, synthetic data emerges as a powerful tool within a broader ecosystem of responsible AI development.
"I don't see this replacing real data," Matschke concludes. "I think it will become a central part of the AI development process, perhaps early in the process. We're moving toward a hybrid model where synthetic data fills gaps in real data or handles rarer, riskier, or privacy-constrained scenarios."
The key lies in maintaining what Matschke calls "due diligence" on data integration: ensuring that datasets are representative, unbiased, and ethical. This requires ongoing collaboration between technologists, ethicists, and domain experts to establish standards that prevent the coprophagic scenarios Doctorow warns against.