Tuesday, 20 May 2008

Keeping track of things: using a labbook for bioinformatics

It's been a while since my last post. Left my last job, was unemployed for a month (while still chairing a session at a conference), and just started my new position here at the Sanger Institute.

CS4 project - a page from my lab book

On every job you're able to pick up some new things that can help you out later. One of the good ones from Roslin was how to keep a labjournal for bioinformatics. In the position before Roslin (at Wageningen University in the Netherlands), I remember having trouble remembering what I did to my data. So I was really happy to see that they actually had thought about those things in Roslin...

So what's the problem?
There are very significant parallels between bench-based labwork and computer-based data mangling. In both, you take some input (e.g. an eppendorf filled to the brim with DNA, or a data file downloaded from the internet), you perform some actions on it (e.g. PCR on such and such temperatures, or a grep followed with a sort and uniq) to get some output (e.g. an agarose-gel picture, or a number). In the wet-lab world, it's almost always mandatory to keep a lab journal in which you write down where you got the DNA from, which concentrations you used from which chemicals and what voltage you use for running the gel. However, for people doing a little bit of scripting to get some data out of a big set of files for example, there often is no such obligation. "I just played around with the data", you'll hear. But they will need a mighty good memory if they are to recall what they did after a couple of weeks. Bioinformaticians (i.c. those who manipulate data) have the same obligation as any other researcher: your work should be described in enough detail that other researchers can repeat the steps to get to the same result.

Enter the SOP for bioinformatics written by my former PI (little wave to Andy). It has some really good suggestions for people involved in data handling, mangling and mining. In this post, I will try to highlight some of them. Note that this is not about application or API development, but about data. (I hope to post a new blog entry about using svn and/or git later about that).

The central tool used for recording bioinformatics work at my previous job was RT Task Tracker, a web-based tool to record tickets and often used for keeping track of helpdesk tickets. I found it a bit too big and having too many features for my own purposes however and decided to write some little application myself that would do just what I need: the Simple Project Logger (sprolog, I can plug this in my own blog, right?). Mind that although I use it at the moment it's still in alpha and full of bugs.

The main requirements of the recording workflow are:
  • If you get data from somewhere/someone else than yourself, record where/whom you got the data from. Of course, there might be updates of the files you downloaded from that FTP server a couple of months ago even though those files have the same name. To be able to tell afterwards, md5sums should be made of any downloaded files and files that were sent to you by email.
  • Any mangling of the data should be recorded. Stuff like "my_script.rb <> output.txt" and "grep 'abc' input.txt > output.txt".

So how does that work in practice?
The sprolog application I wrote has the concepts of project, task and step. A project is a, well, a project. For example: "build my house" or "sprolog". A task is some distinct thing you have to do within the project, e.g. "place the roof" or "add authentication". Each task is then completed by a number of steps ("phoned contractor", "installed acts_as_authenticated").
When starting a new project, I give that project its own subdirectory under ~/Documents/Projects/. In turn, each task gets its own subdirectory within that project, named using the following convention: date + sprolog ticket number + short description (e.g. "20080513_T4-5_GenerateOligosForNewArray"). All work for that task is performed within that directory.
While performing the work, I copy/paste all necessary steps in sprolog. Typical steps look like this:

Step -> Recorded at Tue May 13 15:42:37 +0100 2008:

Saved attachment of John Doe as #{project_dir}/his_data.xls

Extracted tab-delimited version for each chromosome, changed newlines and added # before header.

MD5 (his_data_chr1.txt) = e5c38a91d8e5a666488863099fc5ef1c
MD5 (his_data_chr10.txt) = 9a702fb1f31bec42ec87089fc77efcc5
MD5 (his_data_chr11.txt) = e1a9a63e5c016cf93cb08ea6a5e425e5
MD5 (his_data_chr12.txt) = 8ab2bf7032f56df93b8b10c78bc2e1d4
MD5 (his_data_chr13.txt) = c1b2d609956edcf80657ed5f90b9469c
MD5 (his_data_chr14.txt) = fa6bfda1cd4e76f797ed8bd88d508448
MD5 (his_data_chr15.txt) = 46dbd8de0916dd69e81c519ac05671fe
MD5 (his_data_chr16.txt) = 302d920c6bef199a4bf40cfa2171348f
MD5 (his_data_chr17.txt) = 9aee0113c96f919c0603da3ccb9fca44

MD5 (his_data_chr8.txt) = e0d38c6804e39cae883dedfc648a2cda
MD5 (his_data_chr9.txt) = 94dfc7abc08d2b143f4eb13f29cadbdb
MD5 (his_data_chrUn.txt) = 5275595d7dfd4d4eb664e6bc9b08398c
MD5 (his_data_chrX.txt) = d90ba7f40b1019e1bbf981d894268dbc
MD5 (his_data_chrY.txt) = b991ff6f92869dcdc7b39da71d4d4b16

Step -> Recorded at Tue May 13 15:58:43 +0100 2008:

Venter: email boss gives the conditions on how to select deletions in reference genome.

Just to make sure I've understand correctly, if I want to identify
features for which there is >1kb of non-N sequence for which the
reference sequence has the allele then I identify all sequences in the Excel file and filter on those that have >1000 non-N bases.

Wrote script filter_records.rb to run this filter.

Step -> Recorded at Wed May 14 10:41:53 +0100 2008:

Renamed filter_records.csv to filter_records_on_non_n_bases.rb

ruby ./filter_records_on_non_n_bases.rb > filtered_records.csv

Number of lines in output file: 4411

Next step: repeatmasking

Problem: we don’t have the HuRef sequences, so those have to be downloaded first.

Downloaded assembled HuRef chromosomes from ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/ into ~/Documents/DataRepository/HuRef/

Step -> Recorded at Wed May 14 11:05:26 +0100 2008:

MD5 (hs_alt_HuRef_chr1.fa.gz) = 684e628536fa87b96343f1fea6219328
MD5 (hs_alt_HuRef_chr10.fa.gz) = 02ec433e2b00811db98c77a2fff3d161
MD5 (hs_alt_HuRef_chr11.fa.gz) = cca2a7098ed4d706dc8af7c58a2b9807
MD5 (hs_alt_HuRef_chr12.fa.gz) = 244bfd0f3f26cc109132c5518b2a1fb3

Lab journal
Once a considerable number of steps is performed or a task is completed, I print them out and glue them into a paper lab journal. That might look a waste of paper and completely unnecessary because you've got everything in electronic format anyway. I might change that behaviour later, but for the moment I just like to browse through physical pages when I need to know what I did rather then having to look at a screen. It's also easy to add annotations on those paper pages as well.

Note: if anyone is interested in helping develop sprolog, please let me know.

UPDATE: sprolog is now hosted on github at http://github.com/jandot/sprolog. Development on rubyforge will stop. Get your own copy by cloning it:
git clone git://github.com/jandot/sprolog.git

Note: picture taken from http://www.flickr.com/photos/cdnphoto/301083106/


  1. Interesting stuff, particularly the md5 idea. You might also want to check out the "Getting Things Done" productivity framework by David Allen which parallels your project/task/step breakdown.

    Personally, I find using literate programming techniques such as Sweave to be very useful in log creation. If R isn't your poison, perl/python etc solutions exist.

  2. Good article.
    I use git or subversion to keep track of changes, even on data and results.
    At the moment I am using Makefiles to describe which programs and options I have ran to produce my results, and I am looking for some other similar tool with a simpler syntax.

  3. @gioby: As an alternative to Makefiles, I use rake which does exactly the same but using a ruby syntax (so simpler :-). dgtized and me have actually extended rake a bit so that it can handle timestamps for tasks that are not file-related (so-called "events"). See http://github.com/jandot/biorake

  4. Hi Jan,
    Even if you're not an emacs user, have a look at Org-mode with Org-babel. Org is a system for working with plain text hierarchically-structured documents for project/time/task management, HTML/LaTeX authoring, etc. Org-babel is an extension of Org for working with source code embedded in the Org document. The code can be executed. One design aim was a sort of live lab notebook for computational research, and I use it as my working environment for my work in bioinformatics/computational biology.