Often called the book of life, human DNA has been no easy read for scientists, who face the staggering challenge of figuring out which genetic mutations lead to disease. People carry millions of them in their code, and there has been no efficient way to tell the ones that cause diseases such as cancer from those that simply make ear wax moist.
Now, a research team led by computer engineers at the University of Toronto says it has developed a biological browser, a first-of-its-kind filtering technology that may finally solve the problem.
Like a powerful search engine that mines the web for answers, the new computational system combs the human genome to seek and sort meaningful mutations. Google Inc., along with other companies, has already expressed an interest in it – raising questions about what could, or should, happen with publicly funded technology that is likely to be in demand in a growing world of Big Data.
The technology may transform medical research by pointing the way to the genetic roots of diseases. But not just diseases. The system, which has been named SPANR (short for "splicing-based analysis of variants" and pronounced "spanner"), could also be used to identify traits that make people healthier, smarter and even happier.
"Ours is the first example of a tool that will be able to efficiently figure out what's going on with your genome," said Brendan Frey, the U of T professor of engineering and medicine who led the l0-year project.
At its core is a computational technique known as "machine learning," in which a system is programmed to recognize mutations based on examples researchers have input. With complex forms of it – called "deep-learning" technology or artificial intelligence – the system is designed to detect and decipher. It is the kind of automated reasoning behind the latest voice-, text- and image-recognition engines, popular virtual-assistant apps such as Siri, and now SPANR.
The Toronto system is designed to detect glitches in the vast areas of DNA that regulate genes and have not been extensively studied. But it has also been "trained" with data and algorithms to analyze and rank each mutation in terms of its power to change the way a cell behaves. The higher the ranking, the more likely it is that the mutation leads to disease.
"Computers have been used to read the genome for quite a while, but this is using a computer to interpret and understand the genome," said Prof. Frey, who holds the Canada Research Chair in Biological Computation. "Our system is not perfect, but it works very well."
In a study published online on Thursday in the journal Science Express, the Toronto researchers report that their system accurately confirmed 94 per cent of the known genetic culprits behind well-studied diseases without any information related to the patients or their conditions. It also discovered new genetic mutations linked to colorectal and pancreatic cancers, spinal muscular atrophy (a leading cause of infant mortality), and most dramatically, 39 genes never before linked to autism.
Prof. Frey said the journal rushed to publish news of the system this week because it could bring much needed precision to genetic research, which has often involved collecting and comparing the genomes of sick and healthy people – "tens of thousands of them. But even those numbers haven't been enough to pinpoint patterns or mutations that might be relevant."
Consider how many patterns of text can be created, whole books, with an alphabet of just 26 letters, he said. The genome, meanwhile, is a biochemical alphabet of three billion chemical pairs: "The number of patterns possible in DNA is greater than the number of atoms in the universe."
Manolis Kellis, an expert in computational biology at the Massachusetts Institute of Technology who was not involved in the study, described the Toronto work as a necessary contribution to the field: "We're really learning the cellular circuitry is much more complex than the human mind can grasp."
But, he noted, it would not have been possible without first sequencing many human genomes, since they contain the raw data needed "for training these [computer] models."
Prof. Frey believes machine learning will usher in an age of personalized medicine, when treatments can be tailored to a patient's DNA.
Doctors, he said, could theoretically use the system to quickly produce a list of the significant mutations in any patient. Or, he predicts, within a decade, when many people will have their DNA codes sequenced, it will be a tool, perhaps an app on smart phones, that allows them to share and compare mutations, and possibly crowd-source their meaning by swapping details of their ailments and traits.
"People with a certain mutation in common might find out they're all scared of heights," he mused.
He said people are already uploading their genetic codes into Google's internet data-storage cloud. Last summer, the California-based internet giant revealed its research division had launched its own genome project to catalogue the biomarkers of a healthy human. This month, BlackBerry announced its new Passport smartphone will include a cancer-genome browser for doctors to access a patient's genetic data instantly.
All of this Big Data will require some form of deep learning to interpret, Prof. Frey said. Social networking and video gaming have already made the field so hot that his graduate students are "heavily courted" with six-figure salaries and signing bonuses of $3-million.
His students, he said, are thinking of a startup of their own.
As for himself, he said, "I didn't get into this make money. I got into it transform medicine."
Behind the Frey
Brendan Frey wasn't always a genome man. In the 1990s, the U of T engineering professor used to apply his machine learning know-how to digital communications. But in 2001, his pregnant wife received the results of a DNA test that said their unborn child carried a number of genetic mutations.
"No one could tell us if these were serious or benign. It was frustrating. DNA is like a digital code, and I thought, 'Why can't we understand?'"
It was after that experience that Prof. Frey aimed his computational expertise at the human genome. Specifically, he opted to target those sprawling sections of code once naively dismissed as "junk DNA" because they contain no genes.
Genes, with the protein recipes they carry, have always been DNA's divas, telling cells what to do, and where to do it (as in, "Be a bone! Be a muscle!") Yet genes make up only about 1 per cent of DNA. The rest of the genome, scientists have more recently discovered, encode crucial instructions to direct and regulate how genes operate.
"It's like a recipe book. The genes are the ingredients. But if you just have the ingredients you really don't have anything at all. You have to know the amounts, and what to do with them," Prof. Frey said. "You need the instructions."
Prof. Frey studied what happens when the instructions are muddled with mutations and cannot be properly put together, or "spliced," by the genes they are supposed to regulate.
With a team that included postdocs and graduate students, Babak Alipanahi, Leo Lee, Hui Xiong and Hannes Bretschneider, Prof. Frey spent a decade feeding a computer system a wide range of examples so it would know what DNA looks like, how to read and recognize its text, patterns, and mutations. And in the same way a child learns to read, he says, the system "learns" with mathematical models and algorithms to predict the biochemical effects of what it sees. He estimated the current system allows researchers to read DNA at a Grade 1 level. But like DNA itself, the system, he hopes, will be improved and evolve over time.