Behind the Curve
Education and new tools critical to solving the slow adoption of updated reference assembly data
The Human Genome Project (HGP), in producing the human reference genome assembly, generated a pivotal resource that promised to transform our understanding of human biology and change the future of medicine. The availability of the reference assembly and the ensuing research have transformed the way we see ourselves, the way we diagnose disease, and in some cases, the way we treat patients. Success and reliability for each of these things depends upon the quality and completeness of the genomic data.
Errors, mis-assemblies and incomplete or missing data in the human reference assembly all have the potential to undermine downstream analyses – including diagnostic testing resources. The Genome Reference Consortium (GRC*) was established after the conclusion of the HGP to improve the assembly, ensuring it continues to represent our latest understanding of human genomic biology and serve as the best possible substrate for efforts to advance human health. Though it seems obvious that outdated or incorrect data can negatively affect analyses, researchers and clinical testing labs are generally slow to transition to new assembly versions. In turn, assembly improvements are slow to reach the public.
The failure to transition to a new version largely results from two factors: ignorance of existing assembly problems or new assembly improvements, and limited resources – for data migration, interpretation of migrated data, or use of new assembly model features. Clinical labs are further hampered by the lack of robust validation sets for confirming tests on new assembly data. As long as the cost of transitioning data is perceived to be greater than the cost of using outdated and incorrect data, there will be resistance to the adoption of updated assemblies.
To changing this perception, we need not only education and resource development, but also the combined efforts of the GRC and tool developers working in basic and clinical settings. Therefore, GRC workshops at the 2015 meetings of the American Society for Human Genetics and the Association of Molecular Pathologists explored the human reference assembly, highlighted recent changes, and discussed their clinical and diagnostic implications.
The latest version of the human genome reference assembly, GRCh38, reflects nearly four years of curatorial efforts by the GRC. These include the resolution of more than 40 issues reported by clinical labs and known to affect development or interpretation of genetic tests. For example, an inversion error in prior assembly versions precluded a reference representation for the PTPRQ gene (which, when mutated, is associated with deafness) and has now been corrected. In other cases, the addition of new chromosomal sequences to GRCh38 have added entire gene representations that were missing from previous assembly versions, such as KCNJ18, in which some variants result in thyrotoxic periodic paralysis. Improved reference representation of these and other disease-associated genes in GRCh38 should not only permit the development of more complete genetic testing panels but also improve interpretation of test results.
Tools like the NCBI Genome Remapping Service, UCSC LiftOver, and Ensembl remapping API facilitate adoption of new assemblies by transforming data based on one assembly version to another. However, not all data will map from one assembly to another and some will map to multiple locations, complicating analyses. Additionally, in certain situations data may map differently depending on the remapping resource used. Having a greater appreciation for the reference assembly and the changes it has undergone makes it easier to understand remapping discrepancies and their possible implications for data interpretation.
Assembly improvements also include additional allelic sequences that exist outside the chromosome coordinate space. Alternate loci scaffolds provide representation for genomic regions at which a single chromosomal sequence representation is insufficient to capture population genomic diversity. GRCh38 provides more than 30 different representations of A and B haplotypes in the KIR region, where genetic variation is associated with susceptibility to autoimmune disorders and possibly HIV infection. Likewise, the new assembly contains a growing representation of genomic variants at the CYP2D6 locus, a gene that plays a critical role in the metabolism of many drugs.
Patches represent another class of extrachromosomal sequences and provide fast access to assembly corrections and new variants between the infrequent major coordinate-changing releases. Additional CYP2D6 variants have been added since the initial GRCh38 release as novel patches, while fix patches have further improved gene representation in the assembly. However, some labs and organizations do not yet recognize or include the more than 150 off-chromosome gene representations, precluding their use entirely in the development of new resources. Furthermore, many of the file formats and tools used for variation analysis have not yet evolved to use these additional sequences, which makes the barrier to their adoption even more difficult to overcome.
Timely adoption of assembly updates requires effective education about the reference assembly, tools to facilitate the transition of data between assemblies, and workflows that fully support all sequences in the reference genome assembly. More information about all these things is available from GRC workshops, as well as the GRC website (http://www.genomereference.org) and publications (1,2). The GRC welcomes feedback, especially from the clinical community, about their experience with prior and current reference assembly versions.
*The GRC is a collaboration between The Wellcome Trust Sanger Institute and the European Bioinformatics Institute (Hinxton, UK), The McDonnell Genome Institute at Washington University (St. Louis, USA), and The National Center for Biotechnology Information (Bethesda, USA).
Valerie Schneider leads the National Center for Biotechnology Information (NCBI) team for the Genome Reference Consortium (GRC), the group responsible for updating the reference genome assemblies for several organisms, including human and zebrafish, in Bethesda, Maryland, USA.
- DM Church et al, “Modernizing reference genome assemblies”, PLoS Biol, 9, e1001091 (2011). PMID: 21750661.
- DM Church et al, “Extending reference assembly models”, Genome Biol, 16, 13 (2015). PMID: 25651527.
Valerie Schneider leads the National Center for Biotechnology Information (NCBI) team for the Genome Reference Consortium (GRC), the group responsible for updating the reference genome assemblies for several organisms, including human and zebrafish, in Bethesda, Maryland, USA.