Translational Proteomics: Solving the Reproducibility Riddle
Data analysis in proteomics is not fit for purpose – here’s how we can get on track
Proteomics, with its unlimited potential for biomedicine, has so far fallen short. I believe the reason is simple: sophisticated big data is being processed by simplistic bioinformatics with underpowered computers. Novices are dazzled by thousands of proteins characterized at the push of a button. But experts find that it is mostly common proteins that are correctly identified, much of the quantitation is suspect, and – critically – it is hard to tell whether an identification is correct. How can we improve the utility of proteomics for identifying important low-abundance proteins? The trick is to borrow data analysis from numerical data mining in physics, not abstract statistics.
Let’s say we run a pneumonia sample to identify pathogens from proteins with a mass spectrometer. We process a gigabyte file with 50K raw spectra with a fast PC program that identifies and quantifies peptides and proteins from 20 percent of the spectra at 1 percent error. When analysis is so easy, who needs hypotheses or data understanding? We just need “better” software – defined as faster and cheaper and reporting more proteins. Of course, this assumes 1 percent error is enough, a self-estimated error is always robust, and quantity means quality – all of which are obviously incorrect.
As an analogy, imagine a novel space telescope with revolutionary accuracy, which eases data analysis; no cosmologist would acquire ad hoc imaging data and then shop for prototype software that identifies the most stars for publication, sight unseen. This unscientific approach would find thousands of bright stars but give irreproducible discoveries of faint ones. Content-rich physical data are heterogeneous, with varying signal-to-noise. Deep data require exponentially more computing to mathematically scrub.
Experts can best interpret tricky data. But it’s impossible to uncover one-in-a-million breakthrough data points from particle colliders, telescopes, and now mass spectrometers without computers. Such data require semi-interactive divide-and-conquer – using servers to run overnight “what if” scripts to isolate interesting data pockets for interactive analysis.
For clinical research, 1 percent error is hopelessly imprecise. In infection research, where 99 percent of detected proteins are human, 1 percent false discovery rate (FDR) could mean no pathogen information. For every 10K peptides identified, 100 are incorrectly assigned to corrupt quantitation of 100 proteins.
Clinical research requires 100 percent accuracy for a few low-abundance proteins, not 99 percent including thousands of irrelevant abundant ones. It requires a precision paradigm centered on raw data, not probability models.
In conventional proteomics, data interpretation is outsourced to calculations few understand. Researchers choose a subjective search engine, rely on subjective probabilities to judge peptide IDs, depend on Bayesian inference to aggregate peptide IDs to identify a protein, and evaluate results quality with a single error estimate.
A precise and rigorous abstraction requires three changes. First, simplify protein inference by representing each protein with its longest identified peptide (ideally long enough to be protein-unique). Second, peptide ID filtering should use only physical mass data, not model-based parameters, such as search scores. Finally, the search engine must be demoted from a central role to merely an “educated guesser” of peptide ID hypotheses to be mass-filtered.
For example, in infection research, we develop a hypothesis, acquire data, and then interpret data. The experimental goal is to identify and characterize at least one critical peptide from its noisy spectrum. Importantly, this analysis can be manually validated by an expert.
We may hypothesize a certain pathogen, design a data-independent acquisition (DIA) experiment to maximize the odds of finding certain protein-identifying peptides, then do perhaps a dozen runs to try to capture literally one-in-a-million spectra relevant to our hypothesis. Deep research is inherently a numbers game; new technologies just help increase the odds.
In my view, the narrative that omics means hypothesis-free science is fundamentally flawed. The role of computers and artificial intelligence is to assist – not to replace – scientists who formulate hypotheses and interpret data.
David Chiang is Chairman of Sage-N Research, Inc., Milpitas, USA.