Sunday, 8 March 2009

The good and bad of genome viewers

Back before the human genome was fully sequenced and NCBI, UCSC and Ensembl started working on visualization, it made a lot of sense to go for linear representations and use tracks for annotation. After all: chromosomes are linear. Using different tracks to show different types of annotation is the next logical step.

But there is not just one human genome on earth; according to Wikipedia there's about 6.76 billion copies as of March 2009. So instead of talking about "the human genome" in those browsers, we talk about "the reference genome". Each person on earth is different, and so is each human genome. (That putting the reasoning on its head, but never mind).

Differences between humans such as SNPs and microsatellites can still be shown in the track-based browsers.

Things get more difficult when you're looking at structural variation. Structural variation messes up the backbone of the linear genome browser: you can't show differences between individuals in one straight line. Suppose you want to investigate a copy-number variation (CNV) and consult UCSC. You'd find tracks such as this:

Although this does give you quite some information on the CNV in question, it's not an adequate representation of what the different alleles actually look like. It also highlights another issue: the concept of "the reference genome". As more and more genomes are getting sequenced, is the one that was picked first the best for visualization and indeed, the reference? To be able to handle the different MHC haplotypes in Ensembl, for example, the database contains a table called "assembly_exceptions" that contains the alternative assemblies for each haplotype.

I believe that further down the line (although it might be quite a while) we might need to forget the whole notion of a reference genome. Two options come to mind. First of all, we could create an artificial reference that contains all sequence and let each real sequence we want to look at well, reference, that artificial assembly. That would mean that the different MHC haplotypes for example would all be in the same sequence. Similarly, copy-number variants containing let's say 3 to 8 copies would include all 8 in the mock-assembly. Unfortunately this still cannot cover structural variation like inter-chromosomal translocations. We can't build a single artificial assembly that would incorporate those. So here's the alternative: deBruijn graphs. Instead of creating a single linear representation of a reference, just let's not. We could use building blocks to build up each individual. Take a look at this picture:

Suppose that each block is a part of a chromosome and the red and blue lines represent the path to follow to build up the chromosome for a particular individual. In this picture the red individual misses a part of that chromosome that is present in the blue individual, and another part is inverted. Notice that we don't make any (arbitrary) decision on what is the reference sequence. By dragging the blocks we can either place all red connections on one line or all blue ones, making them look like a reference.

If we'd then add annotations to this picture like genes, we'd be able to display fusion genes. Suppose that the densely-striped block is on chromosome 7 in the red individual but on chromosome 12 in the blue one. If there's a gene on the right breakpoints we end up with a fusion gene.

Time permitting I'm going to investigate how useful this will be in projects like CNVs in the 1000genomes project.


  1. You may be interested in a post on my blog about a year ago, When stars align, where I discuss a similar topics, e.g., IUPAC ambiguity codes, using graphs to represent a composite human reference, and cDNA sequence mapping.

  2. @dd: Nice post! I've seen presentations about Velvet here at work. It caught my interest because the pictures they showed (didn't know deBruijn yet) were exactly how I thought of structural variation. The only difference is that I propose them as a way for visualization rather than for assembly/alignment.

  3. Sure, this makes sense for simple rearrangements, like the ones you display, but we know that many genomes have "hotspots" of rearrangement. What do these pictures look like when you're viewing a 1MB region that has 20 fragments, with copy numbers of up to 10, fused with segments of 3 other chromosomes. (you get my point)

    I'm not trying to dissuade you from working on this, as it's interesting. I'm just wondering how well this will scale.

  4. Chris, I have no idea yet how this will scale. After all: the structural variations I'm looking at at the moment are only the simple ones. I'll look at complex ones later in the year hopefully.
    I wonder how the complex structural variations relate to haplotypes. We might be able to press bunch of loci that are tightly linked in a haplotype into a simple box of the deBruijn graph. Don't know yet.

    Also: a deBruijn graph can be laid out in such a way that what you arbitrarily select as the 'reference' to be nicely on one line to resemble any linear browser. At least on that line we could e.g. superimpose genes and other features (which would highlight gene fusions, for example). Just by flipping which haplotype is the 'reference' we can have our genes and annotations on a linear track.