Research & Development
Cloud solution used to discover new viruses
January 26, 2022
VANCOUVER – As scientists and governments grapple with new variants two years into the COVID-19 pandemic, an international research team led by the University of British Columbia’s Cloud Innovation Centre (CIC) have unlocked a trove of data that could help prevent future pandemics.
The project is called Serratus. “Our singular goal was to discover every coronavirus in the world,” said Artem Babaian (pictured), the project lead.
“At the start of the pandemic, there were only about 44 different species of coronaviruses identified. I felt a complete record of coronaviruses was incredibly important for fighting the pandemic and ensuring we had the tools to prevent another one from ever happening,” he said.
Babaian, a computational biologist, recruited his friend Jeff Taylor, a recent UBC computer engineering graduate, to execute what he calls an exceptionally simple idea. “We were just going to look at everything [in the Single Read Archive] and find every coronavirus possible,” said Babaian.
Simple ideas are not always easy. The public RNA and DNA data libraries located in the U.S. National Centre for Biotechnology Information’s Sequence Read Archive (SRA) have historically been incredibly difficult to fully access. There are over 40 million gigabytes of raw data in the SRA, and 20 million alone relating to RNA sequences, representing 5.7 million biological samples from all over the world.
This massive data cache was growing daily and outpacing Moore’s Law, a measure of traditional computing power.
According to Babaian’s back-of-the-napkin calculations, it would take one computer more than 2,000 years to process the data. Even standard high-performance computing would have required more than a year.
“Everyone has been putting their data in this archive for the last 14 years, but there’s so much and it’s growing so quickly that any kind of large-scale analysis seemed like an impossible task,” said Babaian.
In February 2020, AWS, together with the U.S. National Institutes of Health’s STRIDES Initiative, cloned the entire SRA to Amazon S3 (a cloud storage service).
With the database now in the cloud, Babaian approached the CIC – a UBC and Amazon Web Services (AWS) collaboration providing students, staff and faculty access to cloud technology and AWS expertise. They quickly assembled an international research team and got to work building cloud architecture that could quickly search the massive SRA database and find the viruses they were learning for.
“Before AWS, [Serratus] wouldn’t have been possible, it’s really the perfect case study for CIC,” said Coral Kennett, educational sales leader at AWS. “We’re intended to be a place where researchers and public sector organizations can experiment, fail fast and try again. We didn’t know if it was going to work, but that was the point in them bringing the project to us – to find out what we can do together and find a solution.”
By harnessing the power of 22,500 AWS compute instances and AWS’s cloud storage technology, Serratus split data into smaller chunks and distributed it across AWS architecture to be swiftly and cheaply analyzed. The project found 130,000 new RNA viruses, including nine new coronavirus species. At the cost of half a penny per library, Babaian and his team analyzed up to one million sequencing libraries per day – or 5.7 million in 11 days, taking breaks to sleep. The total cost came in at $24,000.
The team has taken these results and built a global viral surveillance system called The Open Virome, a database anyone in the medical or scientific community can search and within seconds know the origins of an unidentified virus if it has been sampled before.
“While we couldn’t prevent [COVID-19],” said Babaian, “we are creating the infrastructure to make next-generation diagnostics better and more sensitive, with the hope that the next outbreak will be caught earlier before it becomes a pandemic.”
Nature published a new paper co-authored by Babaian and Taylor detailing Serratus’ results this week.
With the database and sequences now publicly available, Babaian and his team will work towards integrating the rich evolutionary data into diagnostic tools.
“Serratus is a catalyst – we’re accelerating a change in virus discovery,” said Babaian. “We’re just one team, we can’t annotate all of this data. That’s why we’ve released it as open source to the public, to get it into biologists’ hands so it can be turned into meaningful human knowledge. Data is meaningless if it’s not knowledge.”