To find structural variation, look at read pairs: introducing pARP
Nextgen sequencing is making a huge impact on how research is done in the genomics field. One of the ways to discover structural variants in a genome for example is to create a clone library for an individual, sequence the ends of those clones and then map those ends to the reference genome. Suppose that the clones in the library are all 150kb large, then we would expect the ends of each clone to be mapped about 150kb from each other on that reference genome, in a forward/reverse direction. Any read pair that does not follow this pattern, might indicate a structural variation. There are of course numerous spurious mapping results, so we need to ignore those.
Suppose that the resulting data look like this:
First two columns are the position of the first read from the pair; third and fourth columns refer to the second read from the pair. Fifth column is FF, RR or DIST: forward-forward, reverse-reverse or distance (i.e. >> 150kb). The last column is some arbitrary quality score assigned to the mapping of this read pair. Notice that the last of these lines shows a readpair where one end is mapped on chr1 and the other is mapped to chr16.
1 1016287 1 1025027 FF 10
1 54809626 1 54814724 RR 20
1 65970649 1 67123551 DIST 32
1 143840263 1 143841351 RR 34
1 241524162 16 298176281 DIST 36
We can do two things: analyze and then create a picture, or create a picture and then interpret (see also one of my previous posts). In the first approach, you'd run a statistical analysis to see if certain regions have a higher prevalence of abnormally mapped read pairs. In the second, you plot the raw data and try to identify abnormalities by eye. Of course ideally you switch between both approaches.
To visualize raw read pair information I've written a tool called pARP (Processing Abnormal ReadPairs) and available from github. It's very similar to the display used by [edited] this paper by Hampton et al to display structural variation using Circos (see picture, taken from the circos website). But instead of just creating a static picture, pARP is meant to be an interactive tool to browse the data.

Below is a screenshot of pARP running on some test data. It doesn't look as nice as the above image, but remember that this is interactive and thus doesn't have minutes to calculate everything.

Some of the features:
- pARP can display abnormal readpairs (forward/forward, reverse/reverse or wrong distance), read depth and other features (e.g. segmental duplications).
- Circular display gives overview of between-chromosome mapped readpairs.
- Chromosomes can be dragged from the circular display to the upper or lower linear display to show (a) more detail and (b) within-chromosome aberrant readpairs (note: none in the image above).
- Visible readpairs can be filtered by quality score.
- Readpairs that are close to the mouse position are highlighted.
Prefiltering of the data should be minimal, and only focussed on getting the amount of data down. For example, the readpair data file could contain all normal readpair mappings, but getting rid of those just makes the display much more visually clear and reduces the amount of data to be loaded by several orders of magnitude (obviously...).
The version just released (tagged v0.8) is workable, but not ready for prime time yet. At this moment the user has to run the tool using jruby instead of just loading it as an applet. Also the filenames to be loaded have to be changed in the parp.rb code itself. I hope to add functionality so that you can upload your own data into an applet, or use a URI to link to it. But can't promise because other work is waiting. So here's also a call for help: if you're interested in contributing, please do! There's a "features-yet-to-be-implemented" list further down.
Features not yet implemented:
- pARP should be available as an applet/application.
- User should be able to point to files or URIs representing files instead of changing filenames in the code itself.
- Saving an image to disk (also from the applet).
- Further performance improvements.
- Fixing of not-yet-identified-but-definitely-present bugs.
And now for some technical stuff. To keep redrawing times low so that the interaction wouldn't suffer too much from the huge amount of data, I had to use a few tricks. First of all, pARP makes heavy use of buffers. Different parts of the image are stored on different buffers. When the user interacts with the display, only the relevant buffers are updated while the others are untouched. For more info, see the github wiki page on the subject. Secondly, I've found out how to use ruby threads to load some data asynchronously. In particular the readdepth data can be a huge hog on performance; there are >6 million datapoints for a genome window size of 500bp. So what happens is that (a) readdepth data for a chromosome is only loaded when that chromosome is displayed in the linear part of the image, and (b) the readdepth data is drawn onto a separate buffer that is only displayed when the thread is finished.
Many thanks to:
- Ben Fry and Casey Reas for Processing
- Jeremy Ashkenas for the ruby API to Processing
Update: reference changed for Circos picture



3 comments:
That looks great Jan.
Wow - really nice!
My lab is responsible for the paper that the first visualization comes from. (http://genome.cshlp.org/content/early/2008/12/09/gr.080259.108.abstract) People might be interested to know that the first image is created using a package called Circos. It's open source and available here:
http://mkweb.bcgsc.ca/circos/
Most of our stuff is in Ruby, and we're heavily invested in doing paired end stuff, so I'm pretty excited to try out pARP. I'm sure I (and others in my lab) will have some feedback for you soon.
Thanks for the correction, Chris. I was wondering where that picture from FlowingData came from...
Any feedback most welcome! You can always contact me directly with that if you want.
Post a Comment