Supercomputer Sequencing
The next step in cancer research may be the use of bioinformatics to analyze large amounts of RNA
At a Glance
- To keep up with the amount of genetic information coming in, our analysis techniques must be high-speed and high volume as well
- Supercomputers offer realistic timeframes for analyzing large amounts of genomics data
- Researchers at Oslo University now use supercomputing to examine RNA transcription errors that cause fusion genes in prostate and colorectal cancers
- So far, cancer researchers have found numerous fusion proteins present only in prostate tumors – which may one day result in better diagnostics and targeted treatment
Over the past few decades, our understanding of cancer has grown increasingly advanced as we learn more about genetics, genomics and biochemistry. But in order to take advantage of this new knowledge, our methods of analyzing disease must progress rapidly as well. Rolf Skotheim, leader of the Genome Biology Group at Oslo University Hospital (Oslo, Norway), and his colleagues, use supercomputing to process the enormous amounts of raw data they gather.
Together with their collaborators at the University of Oslo, Skotheim’s group studies the genetics of cancer. Their focus is on RNA transcription errors, particularly those that are involved in prostate and bowel cancers. “There are two main problems that can occur in transcription – either too much of it, which leads to the production of excessive levels of the given protein, or mistakes in it, which leads to RNA with the wrong composition of base pairs,” says Skotheim. “One such mistake can result in fusion genes, hybrid stretches of nucleic acids where sections of two separate genes are erroneously joined. Fusion genes are commonly found in cancer cells, but can also be present in healthy tissue. In our case, we have been able to identify several fusion genes present only in prostate or colorectal cancers – which we may be able to use as biomarkers to determine the presence and severity of disease, or to offer patients future targeted treatment opportunities. Our aim is to identify and characterize those and other critical genes involved in cancer development.”
Looking at genes in high volumes
The group’s research differs from most types of genetic analysis because they are focused on RNA, and on analyzing it in large amounts. Each set of RNA molecules they analyze consists of about 100 million bases. They sequence millions of short sequences of 100 base pairs each, then run massive statistical analyses in order to localize each one to the correct region of the human genome. “We believe that RNA is the key to the genetic analysis of disease, because it allows us to easily read out the active parts of the genome,” Skotheim says. “Additionally, most genes produce several variants of RNA and protein isoforms, and by taking this into account, we can make great strides in identifying cancer-specific molecules, including those caused by transcription from different promoters, alternative RNA splicing, and fusion with other genes – each of which may generate a totally different protein!” The kinds of readouts RNA offers aren’t available from DNA, and proteins are even more difficult to examine, because they can’t be analyzed as precisely, or in as unbiased a genome-scale manner. That’s why Skotheim’s research is focused on the analysis of cancer samples by high-throughput, paired-end RNA sequencing – though they gain added value by combining that work with DNA sequencing of the same samples for comparison.
“I think it’s important to note that this isn’t the kind of work that can be done by a single laboratory,” Skotheim says of his group’s studies. “We’re part of a consortium of researchers that includes surgeons, pathologists, oncologists, geneticists, bioinformaticists and more. It’s clear to us that multidisciplinary involvement, even within the research groups, is crucial for good translational genomics.” But flesh-and-blood colleagues aren’t the only valued collaborators on the project – Skotheim also works with Abel and Colossus, two supercomputers at the University of Oslo. They are Linux clusters and shared resources for research computing, designed to run many concurrent tasks with large datasets and memory requirements. Abel is a powerful cluster – with 650 computers, it runs on a total of 10,000 central processing units. (And you thought your quad-core laptop was powerful!) At the time of its installation, Abel was the 96th most powerful computer system in the world, offering the Genome Biology Group a huge advantage other cancer researchers may not have. “Needless to say, we wouldn’t get much done without them – you could spend your entire life crunching numbers and still not map a single nucleotide to its correct location in the genome. At 10,000 times the speed of an ordinary computer, we rely on Abel and Colossus to crunch those numbers for us.”
A new approach to disease research
Supercomputers like the ones at the University of Oslo are revolutionizing cancer research. The approach used by Skotheim’s group allows them to parallelize certain tests – like checking the expression level of a gene, or whether or not it is mutated – by analyzing all genes in a single experiment. That saves the researchers from having to make educated guesses as to which genes should be scrutinized in a particular set of disease samples – and means that, ultimately, doctors won’t have to guess which genes to test in patients, either.
The Genome Biology Group is particularly interested in fusion genes in cancer, as they are usually specific to the cancer cells and thus particularly useful in diagnostics. Some might even encode a fusion protein that can be targeted therapeutically! But this is where the supercomputers really come into play. If scientists were to test the RNA in a cancer sample for fusions between every possible combination of two genes among the 20,000 genes present, taking into consideration the fact that the genes can be fused anywhere along their sequences, they would have to set up a virtually infinite number of tests to check for all possible fusions in a cancer sample. And then, on top of that, they would ideally test multiple samples! Of course, sequencing all of the RNA in a cancer sample helps researchers know what to look for when they’re searching for fusion genes – but doing that typically generates about 20 million short RNA sequences. Then comes the job of understanding where in the genome each of those sequences originates and when they reliably match two separate genes. Ambitious studies like these involving large volumes of sequence data require heavy parallelization and combinatorics – things that the supercomputers can facilitate.
“So far,” says Skotheim of his group’s work, “we have identified and published new fusion transcripts from both prostate and colorectal cancers. Combining whole-genome and RNA sequencing of colorectal cancers revealed novel fusion transcripts and splice variants of the WNT effector gene TCF7L2 (1), which acts as a transcription factor in its native form.” He adds that they have also identified three other fusion transcripts, AKAP13-PDE8A, COMMD10-AP3S1, and CTB-35F21.1-PSD2, as novel intrachromosomal fusion transcripts – and not only that, but the most highly recurring chimeric transcripts for colorectal cancers (2). Though they don’t yet know the functional and clinical significance of these chimeric RNA molecules, they hope to elucidate that in the future, because the main goal is to develop some of these new, cancer-specific transcripts into clinically useful biomarkers or drug targets. “It would be great to see a successful clinical implementation of our work, and to know that our research has contributed to improved medical care for cancer patients around the world.”
Supercomputers in the laboratory
But the Genome Biology Group’s studies aren’t the only research that can benefit from supercomputing – advanced technology and bioinformatics have the potential to support all kinds of benchtop work. Wet lab experimentation is often done in only one or a few samples at a time, testing for anomalies in only a few genes or proteins. This kind of selectivity requires deep insight to design the best experiments, and can only be done at low throughput. With the advent of new genome technologies, together with computational techniques, some hypotheses can be generated on a whole-genome scale. They can even be tested on all 20,000 or so human genes at once! Subsequent experimental validation can then be based on data about all of the genes or transcripts in a particular specimen, rather than just a few pieces of information.
“I’m optimistic about the future of computational genomics in cancer research, because I think it’s a key component of modern biology,” Skotheim says. “After all, the amount of data generated in genetics is already overwhelming to many geneticists. Every time you run a high-throughput sequencer, you generate terabytes of data. Current trends indicate that the data flow into genetics will just continue to grow – which means that people who are highly skilled in computer technologies are absolutely essential at this point. Not only do we need people who are able to handle the infrastructure for storing and processing large amounts of data, but ideally they’ll combine this competence with an understanding of the genome biology of cancers. Only by developing these skills in new researchers can we expect future genomics data to be handled in an insightful manner – and that’s what we’ll need if we want to continue to turn that data into reliable, innovative and beneficial research for the cancer patients we hope to support.”
Rolf Skotheim leads the Genome Biology Group in the Department of Molecular Oncology, Institute for Cancer Research, Oslo University Hospital-Radiumhospitalet. He is also an associate professor in the Department of Informatics, University of Oslo, Norway.
- T Nome, et al., “High frequency of fusion transcripts involving TCF7L2 in colorectal cancer: novel fusion partner and splice variants”, PLoS One, 9, e91264 (2014). PMID: 24608966.
- T Nome, et al., “Common fusion transcripts identified in colorectal cancer cell lines by high-throughput RNA sequencing”, Transl Oncol, 6, 546–553 (2013). PMID: 24151535.
While obtaining degrees in biology from the University of Alberta and biochemistry from Penn State College of Medicine, I worked as a freelance science and medical writer. I was able to hone my skills in research, presentation and scientific writing by assembling grants and journal articles, speaking at international conferences, and consulting on topics ranging from medical education to comic book science. As much as I’ve enjoyed designing new bacteria and plausible superheroes, though, I’m more pleased than ever to be at Texere, using my writing and editing skills to create great content for a professional audience.