Working with the IT Crowd

Scientific computing involves high-performance analysis, advanced visualization and modeling, and day-to-day laboratory data management
Those who work in scientific computing divide their time between providing research services and designing and building new technology infrastructure
Laboratory medicine professionals should never be afraid to ask questions about their data, networking, or infrastructure capabilities – especially before buying expensive equipment
Microscopy has the capacity to generate more data than sequencing – and as science and medicine become more data-heavy, the paths of pathologists and computer scientists will cross more frequently

“Scientific computing” is often treated as a catch-all term that means different things to different people. I define it as the application of computers to scientific research, which in a biology laboratory can mean one of three different things:

High-performance computing (for mathematics and large-scale data analysis)
Advanced data visualization (including 3D modeling)
Laboratory data management (sample tracking, data collection and analysis)

All three of these aspects are becoming increasingly commonplace as laboratories move toward digital work. In biomedical research, the volume of sequencing is skyrocketing – and we’re doing a lot more image processing, mathematical modeling and analysis of large amounts of data, too. In pathology, as digital imaging and telepathology slowly encroach on traditional onsite methods, the demands for scientific computing are growing ever greater. But even with these advances, many users aren’t well-acquainted with scientific computing. So who are we – and how can we best provide resources to the people in the laboratories?

A day in the life

In my department – IT and Scientific Computing for the Cancer Research UK Cambridge Institute – our days are spread across three sets of tasks. Firstly, we are responsible for operating all of the equipment. We have storage arrays; we have power-hungry supercomputers; we have standard and high-end workstations; and, of course, we run all the networks – from high-performance right down to the wireless setup everyone uses. We have to keep everything running smoothly, monitor performance, handle security, and deal with any incidents. If you manage your own home network, you can probably imagine that it fills up a lot of time… Secondly, in response to what the scientists need, we have to build new systems. That means buying new storage systems, building new servers, replacing the high-performance computing systems… We’re always working on large-scale infrastructure and software projects, which is how we not only stay up to date with technological advancements, but build in growth as well. The third thing that we have to do – because we’re not always replacing last year’s equipment with this year’s version of the same thing – is keep abreast of the latest developments in high-performance computing, networking and software, in particular. Most of our customers are biomedical scientists, and the techniques they use change very rapidly. We have to make sure that we provide the right systems to support those techniques. So we spend quite a lot of our time looking one step ahead, whether it’s investigating graphics processing unit (GPU) computing for visualization or considering low-energy computing to grow our data center. We study any area with significant change, so that we can bring that knowledge back to the Institute.

What do we compute?

Most of our high-performance computing resources support the handling of large DNA sequencing datasets – of which there are many. All sequence data has to be aligned to reference genomes and then interpreted, which takes up the bulk of our CPU hours. Less common are the mathematical tasks we undertake, including Bayesian modeling, systems biology, and data visualization at the single-cell level. The latter is key because it helps us better understand diseases like cancer. We have new imaging platforms that allow us to unpick the real-time behavior of single cells and to reconstruct three-dimensional tissues down to the subcellular level. It’s both visually impressive and medically fascinating. But the real workhorse is still genomics. We generate tens of terabytes of genomic data every week, so the machines must run 24 hours a day to keep up with analysis of the pipeline. Over the last decade or so, genomic data has represented around 80 percent of the data produced and processed here at the Institute, and it’s about to ramp up; the next generation of equipment can generate data at a terrifying rate – one that will actually push the boundaries of modern networking capabilities. Conversely, microscopy has traditionally been very human-driven – a pathologist was always needed to manually examine and annotate the images. But it’s amazing how much microscopy has advanced. We used to be limited by the diffraction barrier, but now? So many techniques bypass it that it’s rarely even mentioned anymore! Even with a light microscope, we are able to see single cells in incredible resolution – and combined with other techniques, we don’t need to stop at just one cell; we can do hundreds at once. We’re still catching up to our own capabilities in that respect, because our old, familiar techniques – counting, measuring, highlighting – are time-consuming. It’s all well and good generating 100 cells’ worth of data in a single microscopy session, but without a computer to analyze the data at the other end, there’s little point – the limiting step will always be the pathologist. Once you have the algorithms and automated techniques to make it worthwhile, though, microscopy could easily generate more data than sequencing. I think we’ll soon see new imaging platforms that are much more heavily automated than previous generations.

Know your needs

The Cambridge Institute has a dedicated scientific computing department in part because we don’t expect biomedical scientists to know everything that can be done with technology. Our role is to understand just enough about the biology and the available technology to help researchers identify their needs and design their ideal system. Right now, for instance, one research group is buying a new light sheet microscope, so we’re discussing storage, analysis, and workstation options – along with anything else they may need to know. My team’s approach is to maintain a close working relationship with scientists and clinicians so that we can let them know what’s possible – and what’s not.

The other thing we’re seeing, especially in genomics, is a shift in the way research groups are structured. Gone are the days when a group was entirely composed of biologists, maybe bringing in a statistician for mathematics-heavy work. Nowadays, successful groups always have at least one bioinformatician for analysis and data management, which also avoids the need to partner with an analysis team that can increase miscommunication or dilute the author list on an eventual publication. I don’t think we are quite at the stage where every research group needs someone who can build their own computer architecture. Why? Because, no matter what you’re researching, there’s almost always someone at least one step ahead of you – and most of them are perfectly happy to talk about their technical architecture. All you have to do is ask questions, which is something we do on behalf of our researchers.

Practical computing

That said, there is definitely an argument for having all life sciences researchers trained in at least basic bioinformatics. Not very long ago, you could probably get away with being a biologist or a clinician without knowing statistics (though if you had a good grounding, you would go further and faster). PhD students used to have a couple of spreadsheets of data and a few images to analyze; now, it’s a couple of terabytes of data and a few thousand images. You can’t do that manually, so data manipulation is now a standard part of any biomedical researcher’s toolkit. If you’re looking to move into data-heavy research, don’t assume you need to know everything about computer science. Focus on the skills you need. You can often figure out what those are by seeing what others in your field have learned. Then, start with something you’re going to use. Don’t go off and learn how to program in C++ if what you need to do is manipulate files and analyze data. Tools like PERL, PYTHON and R are designed for biologists – they’re a good way to get into scientific computing. Once you’re familiar, you can begin to understand how much you can do with the available tools and to what extent your unique needs may call for something more advanced. Even in the latter case, don’t jump in at the deep end. Start with a bioinformatics course rather than going straight into computer science studies. Learn skills that follow on from what you already know. The key benefit to scientific computing is that you can design experiments that generate a lot more data. For example, in the days of Sanger sequencing, you had to establish a very simple hypothesis and be careful about what you chose to sequence. Now, you can look at clonal mixtures in cancer, see the signature of each clone, and work out their relative amounts; you can create complex mathematical models for causally driven experiments; you can examine effects down to the cellular level. In other words, your experiments can be more powerful, allowing you to generate great amounts of data in parallel – and that makes your science much more quantitative, gives your hypotheses better support, and may even allow you to discover things you didn’t know to look for. The downside? As you acquire more data, the likelihood of losing discoveries in the noise and complexity is much higher. Therefore, you need access to far better mathematical modelers and statisticians than in the past to help you pick out the interesting things. The good news is that, with the increase in data volume, there are many more of those interesting things than before!

Working with your computing department

I came from a chemistry and physics background where the line between my discipline and computer science was always blurred. When I first started biomedical research, there was a gulf between the biologists and clinicians and the bioinformaticians and computer scientists. But the gulf is narrowing fast; firstly, because the general population is much more computer-literate and secondly, because quality (and therefore publishable) science nowadays demands complicated analyses of large datasets. Successful research groups know they need mathematical, statistical and programming expertise, so they either learn the skills or partner with someone who already has them. People no longer knock on the scientific computing department’s door with basic computing questions; they come to us with large-scale physical infrastructure needs.

There are two things I’d like people to consider: one, those in scientific computing are there because they’re interested in science – otherwise, they’d be working in a bank. There can be a tendency to bring problems to us only at the point when they’ve become technology issues, rather than higher-scale computing problems. I’d like our research and clinical colleagues to know that it’s never too soon to involve your computing team in the design of your experiments and systems. And two, never, ever buy expensive lab equipment without asking some hard questions about the type of data it’s going to produce – and whether you have the infrastructure to support it. Look at the infrastructure others in the field have put in place to get an idea of the scale of computing you’ll need. And don’t wait until after you’ve bought the device before you talk to your computing providers about whether your network and systems can cope with it. When our first next generation sequencers arrived, we had to completely change the IT infrastructure of the building; now, as our new microscopes arrive, we’re having to do it again. Those kinds of changes have a big impact on us – so don’t let a salesperson sell you a new technology. Get the opportunity to see it in someone’s working environment first, so that you can anticipate the infrastructure changes it will necessitate. Talk to them, and then talk to us!

What lies ahead

One big challenge for the future is that there are not enough computing people with life sciences skills. The next generation of bioinformaticians and image analysts will increasingly have to take on the role of scientific computing expert. In the physical sciences, researchers have always done their own analytical computing, and I think that will increasingly become true in biology and medicine. Computer scientists understand computing, but we don’t necessarily understand the techniques biomedical researchers use or how best to apply them. As technologies on both sides advance, we’ll have to work more closely together than ever.

Another challenge is the sheer amount of data. We’ll have to store it, move it, back it up, and make decisions about what and when to keep or discard. There are technologies that can cope with it all, but they’re not all necessarily in the same place – so we have to use something most people have heard of, but fewer understand: “the Cloud.” Users have to learn that the Cloud is not a single place; it’s many places, and you have to build up a mental map of where your data is, where you need it to be, and how you can get it from one place to another. And if you’re not sure, ask your computing department! People should never be afraid to ask where their data is and how they can access it. Sometimes, just understanding what’s going on can save you a fortune or a lot of time – or prevent a data loss disaster.

On the whole, the future of scientific computing looks good. We’re seeing a shift from CPU to GPU computing, which means that computations that now take minutes should soon take only seconds – although we’ll have to stay on top of the change (GPUs require very different programming; analyses that work on current computers will need to be ported to these new technologies). Changes to memory hierarchies are also underway, which means that people’s interactions with computers, from simple email all the way to high-level image analysis, will become much faster. Data storage technology is advancing rapidly – driven by Facebook and YouTube, of course, but we in science benefit from it as well! I’m looking forward to seeing how the paths of researchers, clinicians and computer scientists converge over the next few years – and excited about the knowledge we stand to gain as a result.

About the Author(s)

Peter Maccallum

Peter Maccallum is head of the IT and Scientific Computing team at the Cancer Research UK Cambridge Institute.

Working with the IT Crowd

How can researchers, clinicians, and computing departments collaborate in the most productive way?

A day in the life

What do we compute?

Know your needs

Practical computing

Working with your computing department

What lies ahead

About the Author(s)

Peter Maccallum

Explore More in Pathology

Recommended

Breathing New Life into Diagnostics

Opening a Window into Brain Trauma

Molecular Spectacular

Cracking Colon Cancer

Explore

Featured Topics

Issues

Career Development

Educational Resources

Events

People & Profiles

Working with the IT Crowd

How can researchers, clinicians, and computing departments collaborate in the most productive way?

A day in the life

What do we compute?

Know your needs

Practical computing

Working with your computing department

What lies ahead

Newsletters

About the Author(s)

Peter Maccallum

Explore More in Pathology

Recommended

Related Content

Breathing New Life into Diagnostics

Opening a Window into Brain Trauma

Molecular Spectacular

Cracking Colon Cancer