Electronic Records
Synthetic data breaks biggest bottleneck: Access to high-quality data
February 26, 2026
The coronavirus pandemic, first detected in 2019, was a worldwide wake-up call to epidemiologists and clinicians. In Canada alone, more than 60,000 people had died from COVID-19-related illnesses by September 2024.
Seniors were particularly vulnerable; so were those with chronic lung or heart conditions and the immunocompromised. Specific jobs and circumstances that created certain patterns of interaction with the population at large also increased the risk of exposure.
In COVID’s wake, Canadian researchers prepared for the next pandemic by creating “synthetic populations” at the health region level – the level at which most public health decisions are made – using open data stores.
“The idea was to create a synthetic ‘Canadian world’ that reflects real population structure,” said Dr. Khaled El Emam, a professor of epidemiology and public health at the University of Ottawa and director of the Ottawa Medical AI Research Institute (OMARI).
These synthetic Canadians were assigned medical characteristics, daily routines, locations and other attributes. The model allowed researchers to simulate pandemic preparedness scenarios without using personally identifiable data.
“It’s a good illustration of how synthetic data can support large-scale public health modeling and planning, while avoiding many of the access barriers and privacy risks associated with using real health records,” Dr. El Emam said.
What is synthetic data? Synthetic data – data sets generated to mimic the properties, patterns and structures of real-world counterparts – are becoming increasingly valuable for healthcare settings. That’s because sometimes, doctors and data scientists can’t use actual datasets due to privacy regulations, or the research cohort is too small to be usable.
“In healthcare, it’s very common to hire data scientists and developers to innovate, only for them to spend months waiting for data access,” said Dr. El Emam. “Synthetic data can often be made available much more quickly, which lets teams build and test pipelines, develop and validate models, and explore data early.”
Synthetic data is not simply “made up”; it’s generated by training on original datasets to create datasets that mimic their behaviour. Those datasets could be from clinical trials, health system data from routine care, billing, payment and claims systems, research data held by institutions or governments, such as the open data used to generate the “synthetic Canadian world”, and more.
By contrast, simulated data is created using human-specified statistical attributes, said Lorne Rothman, principal data scientist and healthcare lead for SAS Canada, the Canadian arm of data and AI leader SAS. “A researcher might write a program to generate data that contains particular patterns, specific averages, spreads, and distributions of measures.”
To create synthetic data, machine learning algorithms analyze samples of real-world data, producing new data that mirrors patterns in the original data set.
“The most useful synthetic data typically comes from high-quality real data with strong statistical fidelity, diversity, longitudinal depth and balanced representation of variables – demographics, diagnoses, treatments,” said Dr. Margarita Mersiyanova, health and life sciences industry consultant with SAS. Synthetic datasets are most often generated using statistical and probabilistic methods, machine learning generators, or a hybrid of the two, she said.
Building trust: Synthetic data can be particularly valuable for specific use cases, Rothman said. For example, the cost to collect or purchase real data might be prohibitive; existing real data might require time-intensive, error-prone labeling; existing data might be biased or lack adequate representation of groups or segments; or next to no data might exist for some scenarios.
But using synthetic data comes with risks. Mersiyanova points to a possible problem common to generative AI technologies: hallucinations. These are the false, fabricated pieces of content that large language models can produce, while amplifying original biases within the data set.
“If we don’t have good underlying data to start with, it will impact any data that is synthetically generated,” Dr. El Emam said. “This is a broader issue in healthcare, where data collection tools are primarily designed for administration, and data is spread across various systems.
“Before any generative AI-assisted tool touches care delivery, you need guardrails that focus on safety, accountability and trust.”
And, as always with healthcare, privacy is the elephant in the room.
On the one hand, synthetic data can be a valuable tool for de-identification. In Ontario, regulatory guidance specifically acknowledges how it can reduce re-identification risk in structured data.
On the other hand, there’s a potential risk of “over-fitting,” which occurs when a generative model is trained on a data set so closely that it reproduces records, leading to identity disclosure. Proper training practices and post-generation testing can ameliorate that risk, Dr. El Emam explains, but generated data can never be a zero-risk proposition.
“Zero risk is not achievable or expected in practice, and if you transform data so aggressively that you destroy its utility, there’s no point in transforming it at all,” he said.
Rothman describes the creation and use of synthetic data as a “privacy-accuracy balancing act.”
“There’s a trade-off. The more realistic a synthetic data set is, the less privacy it preserves,” said Rothman. “Adding more ‘noise’ to protect privacy makes the data set less realistic.”
Human oversight is essential. Generative AI tools should support clinicians, not replace them, said Dr. El Emam. There must be clear rules about how these tools are used, when a human must review or override the output, and who is ultimately responsible for decisions. It also means paying attention to how the tool is actually used in practice, not just how it performs in testing.
“For tools that influence care, it’s not enough to show that a model works in a lab setting. You need evidence that it performs reliably in the real clinical context, and you need to test for known failure modes of generative systems, such as hallucinated outputs, inconsistent behaviour across populations, and performance drift over time,” he said.
Looking ahead: Rothman said there is growing interest in synthetic data, but its use across Canada is still rare.
“Initial interest focuses on the creation of structured tabular data, tables of health information,” he said. “It also shows promise for generating synthetic medical images to support research and diagnostic applications.”
Many of those applications require multiple tables in relational databases. Maintaining those data relationships in synthetic data can be challenging.
Adding to that is the longitudinal nature of much healthcare data. Synthetic data generators must preserve the relationships of data through points in time. Has a patient’s condition improved? Is the disease’s representation in various populations changing?
“If a synthetic data set does not look and behave like real-world data, then it won’t be useful, and in the worst cases, it can be counter-productive or even harmful,” Rothman said.