When the two proteome maps appeared in Nature, the numbers certainly raised some eyebrows. My colleagues and I are part of the GENCODE consortium, which is annotating the human genome, so we are very interested in large-scale proteomics information. We were also in the process of publishing our own analysis (1), and we were surprised by what these papers were reporting. How had they managed to find more protein products from genes than any previous experiment of this kind, finding several thousand more genes than the entire combined efforts of the worldwide human genome project, all without any kind of technological breakthrough?
When we looked at our data, we noticed we had not identified any peptides for olfactory receptors (ORs). Further, other databases, such as PeptideAtlas and PRIDE-Q (2), which I consider to contain high quality data, also identified very few ORs. We therefore reasoned that a study which identifies multiple ORs (Pandey’s group found 108, Kuster’s 200) is likely to be unreliable. We decided to investigate.
We carried out a quality test on the ORs the groups had found, and this produced some concerning results. For example, the Pandey data shows that ORs are most highly expressed in the liver (3). For us, this confirmed what we had suspected – the data was not reliable.
We carried out a reanalysis of the peptides detected in both experiments, and found reliable evidence for between 7,500 and 8,000 of the genes they identified. The Pandey group’s data was entirely their own, published previously in the Journal of Proteomics Research. The Kuster group carried out comparable experiments on a similar number of tissues (using CellZome technology), but in their paper they also included results from a reanalysis of the spectra from previously published large-scale experiments. However, they did not provide the results of their re-analyses, meaning we could only analyze the CellZome data, which is 25 percent (roughly 4,500 peptides) of the Kuster data (although the CellZome data alone identifies genes for 36 ORs).
The Pandey group reported 17,296 genes and the Kuster group over 18,000. I personally believe that the Mann Group (4, 5) identified as many if not more protein coding genes than the Pandey group and Kuster’s CellZome experiments. We carried out a comparable analysis of these experiments at the same time as the proteome map data, and after filtering the peptides we found that the various studies had identified 8,050 (Nagaraj et al), 8,929 (Geiger et al), 7,972 (Kuster CellZome) and 7,458 (Pandey). This led us to conclude that the two proteome maps contained questionable data.
Our analysis identified many factors which I think contributed to this data inflation: the inclusion of poor quality spectra, using a single peptide to identify multiple genes, confusion between leucine and isoleucine, the use of two search engines to increase the peptide coverage rather than to increase the reliability of the peptide spectrum matches, and the combination of multiple experiments (which ratchets up false positive rates). Some of the problems we identified only affected one of the two data sets and some affected both (3).
These two studies stand out because they analyzed a wide range of human tissues, rather than cell lines. It’s possible that research groups carrying out tissue-specific experiments will use this data as a gold standard, and even now will be writing proposals based on it. This concerns me because I think, at best, this data will not aid good scientific research. At worst, I suspect using this data could be a poor use of time and resources. In situations like these, the onus is on the authors to provide information that is as accurate as possible.
Large-scale evidence for cross-tissue peptide expression would be a real step forward for proteomics. However, the information provided by these draft proteome maps cannot be used without first filtering out large amounts of possibly unreliable data.
Have an opinion on this topic? Please feel free to join the debate, by posting a comment below.
- I. Ezkurdia et al., “Multiple Evidence Strands Suggest That There May Be As Few As 19,000 Human Protein-Coding Genes”, Human. Mol. Genet. (2014) [Epub ahead of print].
- J.A. Vizcaino et al., “The Proteomics Identifications (PRIDE) Database and Associated Tools: Status in 2013”, Nucleic Acids Res., 41 (D1), D1063-D1069 (2013).
- I. Ezkurdia et al., “Analyzing the First Drafts of the Human Proteome”, J. Prot. Res., 13, 3854-55 (2014).
- T. Geiger et al., “Comparative Proteomic Analysis of Eleven Common Cell Lines Reveals Ubiquitous but Varying Expression of most Proteins”, Mol. Cell. Prot., 11(3), 1-11 (2012) doi: 10.1074/mcp.M111.014050.5. N. Nagaraj et al., “Deep Proteome and Transcriptome Mapping of a Human Cancer Cell Line”, Mol. Syst. Biol., 7, 548 (2011).
Michael Tress is a staff scientist at the Spanish Cancer Research Centre (CNIO), Madrid, Spain.