The Scattered State of the Reference Genome
Lately, I’ve been intrigued by the variety of reference genomes used across the genomics community. The picture of just how many different versions of the genome exist began to become clear as I researched what reference genome other major projects were using for this RNA-Seq pipeline RFC. I did a moderately comprehensive analysis of these in this Jupyter notebook, but this post serves as a short commentary on what I found, why I think it matters, and what the takeaways are moving forward.
The concept of the reference genome is an important one in bioinformatics: before we can understand variation in an individual’s genome, we must first have a reference that we compare against (akin to the picture on the front of a puzzle box). At the time of writing, the most up to date version of the reference genome is GRCh38.p13. The first version of
GRCh38 debuted in 2013, but it has been receiving semi-regular patches since that time (thus, the
.p13 for “patch 13”).
Though this reference included many improvements, the community was relatively slow in migrating from
GRCh37 (introduced in 2009). This was likely for variety of reasons, but maybe most importantly because analysis pipelines are built around the particular version of the reference genome used: any update to the genome requires tens/hundreds/thousands of other dependent reference files to be updated as well, end-to-end concordance checks to ensure the results aren’t affected negatively by the change (often time consuming and complicated to interpret), and then potentially reprocessing your entire cohort (time consuming and computationally expensive).
GRCh38 defined the most complete reference genome to date, including sequence-based representations of centromeres and additional alternate loci representative of a more diverse set of the population. Unfortunately, using the genome as published complicates many bioinformatics analyses, so most projects use the modified
GRCh38_no_alt analysis set. In this build of the genome, (1) the alternate haplotypes are not included, (2) the pseudoautosomal regions on chromosome Y + many centromeric regions have been omitted by hard masking the genome, and (3) the EBV genome was included. This build limits some of the benefits that GRCh38 promised (particularly the alt haplotypes), but as Heng Li pointed out in 2017, there is no perfect reference genome and the no alt analysis set appears to be the best option for now. I’ll also note here that the no alt analysis set is not updated with the new patches of the reference genome, so the years of information we’ve learned about the genome since 2013 is not being leveraged in projects that use it.
Fast forward two years later: all of the major projects I surveyed use the base GRCh38 no alt analysis set as their reference genome, but the number of other sequences or genomes added to that base set varies widely. You can view the sequence dictionaries I considered for yourself, but just eyeballing the data shows:
- The ENCODE consortium includes ERCC and phiX spike-ins sequences.
- The GDC adds many, many sequence decoys and virus sequences.
- A whole range of changes are made to the genome for TOPMed’s RNA-Seq pipeline.
Many tools in bioinformatics expect the sequence dictionary to be exactly the same across samples that are analyzed. This means that any analysis wishing to combine samples from multiple sources will have to rerun alignment from scratch to get things working. With the world generating more next-generation sequencing data than ever before (both in terms of quantity and data per sample such as deep whole-genome sequencing), this is increasingly problematic. For instance, we currently have 11,589 whole-genome BAM files in St. Jude Cloud. At a cost of 30-100 dollars per sample to process in the cloud (alignment + variant calling), you’re looking at a 500,000+ dollar bill to reprocess them. Suffice it to say, that’s not something most research labs can afford to do. Our goal on that project is to serve the community of bioinformatics wishing to use our data, so we think about this problem a lot.
- Migrating analysis pipelines from one version of the genome to another is a huge time sink for the community, so we shouldn’t take changing it lightly.
- The additional alternate haplotypes included in
GRCh38were a step forward, but these loci continue to complicate bioinformatics analyses, so they aren’t leveraged in the projects I surveyed.
- Though we learn more about the genome each year, the no alt analysis set isn’t updated with these findings. This leads to a lot of really interesting incongrueties.
- Here’s an example of one we considered for our RFC.
- The variety of non-standard sequences that are added to reference genomes across complicate cross cohort analysis.
- To my eye, this is most the most interesting point. Including all of these small, non-human sequences may be beneficial or even required depending on your use-case, but when weighing that against the cost introduced to those wishing to combine samples across projects, we have to ask “is it worth it”?
For takeaways, the only thought I have is whether it’s possible to unify the set of non-human sequences used across some major projects.
GRCh39 is indefinitely paused according to the notice currently on the Genome Reference Consortium’s page, but it was recently announced that a new multi-million dollar grant was awarded to expand the genetic diversity in the human genome. Hopefully these thoughts will be considered on our next pass.