BC cancer researchers turn to the Microsoft cloud to meet fast-growing requirements
March 29, 2021
A cancer-free world – that’s the ambitious mission of BC Cancer. Based in Vancouver, the agency conducts leading-edge cancer prevention, screening, research, and treatment.
Within the Department of Molecular Oncology at BC Cancer, the Aparicio lab focuses on precision medicine, which is far more effective than treating cancer patients as a homogenous group. It studies individual cancer cells because each type of cancer in each patient is different. The lab’s scientists do single-cell genome sequencing to understand the diversity that exists within a cancer, learn how those tumors become resistant to treatment, and create the most targeted treatment possible.
The methodology is ground-breaking – and computationally intensive. The research creates a massive amount of data that the Aparicio lab stored and processed in an expensive on-premises environment. “We had millions of dollars of on-premises servers that we purchased through grant funding,” says Daniel Lai, Senior Bioinformatics Scientist at BC Cancer. “They powered our research for almost a decade, but they had limited lifespans and capabilities. We needed to move to the cloud to continue supporting research at the scale we need.”
Data for machine learning: As the lab continued to generate a huge volume of data, it turned to AI and machine learning to find patterns in its genome research. “By using machine learning algorithms, researchers are able to rapidly create virtual machines that are capable of pinpointing mutations in the haystack that is the human genome,” says Lai.
To store its data and support its computational needs, the Aparicio lab adopted the Microsoft Azure cloud platform three years ago. Data experts at the lab create their own machine learning algorithms that they run in Azure by using Azure Virtual Machines. A 2019 paper published in a Cell Press scientific journal detailed their sequencing methods and algorithms and how the lab solved computational challenges with innovative tools and techniques. “Because we use genomic approaches to study cancers in newer and newer ways, we have to develop our own algorithms,” says Samuel Aparicio, Chair of Breast Cancer Research at BC Cancer.
However, the lab still ran its clusters primarily in an on-premises environment, and more of its hardware was reaching end of service. The lab knew it needed to move its entire genome database to the cloud to get the compute power and scale that it needed. Lai explains just how much data the lab works with: its single-cell sequencing approach creates a thousand input files every two days. In a week, researchers face 4,000 individual novel files that its system must analyze, which takes a few minutes to an hour each. “The amount of computation we need exploded and grew out of scale very rapidly,” says Lai. “Our 600-core cluster of 40 computers couldn’t keep up anymore.”
Collaborative migration and data sharing: The Aparicio lab began migrating its massive 1-petabyte genome database to the cloud. It received technical support from local company Invero, a member of the Microsoft Partner Network with Gold competencies and the Cloud Summit 2019 Canadian Microsoft Azure Partner of the Year. Invero met with the lab researcher team to understand its data requirements and how it interacts with the data. “We wanted to ensure that the Aparicio lab had a method of cataloging its data so the metadata was consistent and staff could easily search it later,” says Craig Slack, Chief Executive Officer and Co-Founder of Invero.
The lab also worked with a team of Microsoft cloud solution architects who answered Azure-specific questions for the BC Cancer software developers and offered guidance and best practices. The Microsoft team held onsite training for the Aparicio lab on how to best administer the cloud environment while also helping ensure that researchers could access data they need.
Part of the lab’s vision around partnerships also extends to other research institutes. A researcher from BC Cancer recently moved his lab operation to Memorial Sloan Kettering Cancer Center in New York, but he planned to continue collaborating with former colleagues for several years.
By moving its data to Azure, the Aparicio lab created a shared computing environment from which many researchers in the field can benefit. “Thanks to Azure, researchers in different countries can share knowledge and collaborate seamlessly on the same datasets that could lead to treatments,” says Aparicio. “Our field increasingly will need this type of partnership with other institutions and a common computing environment, so we’re trying to lead the way.”
Increased computation, reduced costs: The Aparicio lab now processes the high volume of data it needs, generating a terabyte (TB) of data a week and 400–500 TB of data per year. “We would need to buy a new hard drive every couple days and a new server rack every couple of months just to keep up with this pace,” says Lai. “We’ve solved that completely with Azure because we have effectively limitless computational power.”
Those files must be processed every night after they’re uploaded. According to Lai, it ranges from several cents to several dollars per hour to run the virtual machines, which can add up quickly with thousands of machines. So, the lab adopted Azure Batch, a job scheduling service that triggers the tens of thousands of jobs that are required to process these files. Batch uses low-priority servers that are idle in the datacenter and that aren’t immediately used or required.
“On average, using Azure Batch costs about 10 percent of a normal virtual machine,” says Lai. “By using these low-priority instances with Batch, we can process tens of thousands of files every 72 hours and do 10 times the work for the same price compared to standard virtual machines.”
Scalability and speed: The lab gained the scalability and speed it needed by moving its genome database, running its machine learning in Azure, and using Batch to trigger processing jobs. Ultimately, its improvements in data management and analysis will help inform cancer diagnoses and treatments as research eventually moves into clinical applicability. “With our flexible computing environment in Azure, we now have a way to accelerate and scale our processes so we can learn more by operating with more data,” says Aparicio. “The more patient information we can aggregate, the more power we have to learn about important but subtle effects that are easy to miss without enough data.”
Adds Lai, “Previously, we’d get held up for a year on something that we can now resolve in less than a month. Our lab uses machine learning on Azure to make predictions about tumors with a speed and accuracy that humans can’t do.”