Received an email this week from Sanger helpdesk that they installed a test hadoop system on the farm with 2 nodes. Thanks guys! First thing to do, obviously, was to repeat the streaming mapreduce exercise I did on my own machine (see my previous post). Only difference with my local setup is that this time I had to handle HDFS.

As a recap from my previous post: I'll be running an equivalent of the following:

cat snps.txt | ruby snp_mapper.rb | sort | ruby snp_reducer.rb

Setting up

First thing the Sanger wiki told me was to format my HDFS space:hadoop namenode -format This apparently only affects my own space... After that I could start playing with hadoop.

Photo by niv available from Flickr

I have long been interested in trying out mapreduce in my data pipelines. The Wellcome Trust Sanger Institute has several huge compute farms that I normally use, but they don't support mapreduce jobs. Quite understandable from the IT's and institute's point of view because it's a mammoth task to keep those things running. But it also means that I can't put a foot on the mapreduce path.

There are options to run mapreduce on your own, however.

Worked a couple of days on pARP, the circular genome browser, and I think it's ready to be tested out by others. Consider this an alpha release: expect a lot of issues. It's easy to create regions with a negative length, for example. Also, I didn't focus yet on user-friendliness or general input files. Ways of interaction are not made clear to new users yet and the input files still need to have fixed names and be stored in a particular folder.

"Contigs should not know where they are." That's a phrase uttered by James Bonfield when presenting his work on gap5, the successor to gap4, a much-used assembly software suite. So you think: "Wait a second: you're talking about assembly, and the contigs should not store their position?"

This statement addresses a problem that we encounter often when working with genomic data: how to handle features.
2

Back before the human genome was fully sequenced and NCBI, UCSC and Ensembl started working on visualization, it made a lot of sense to go for linear representations and use tracks for annotation. After all: chromosomes are linear. Using different tracks to show different types of annotation is the next logical step.

But there is not just one human genome on earth; according to Wikipedia there's about 6.76 billion copies as of March 2009.
4

Image by Danny McL via Flickr

There’s been quite a lot of discussions going on lately about author identification: Raf Aerts’ correspondence piece in Nature (doi:10.1038/453979b), discussions on FriendFeed, ... The issue is that it can be hard to identify who the actual author of a paper is if their name is very common. If your name is Gudmundur Thorisson (“hi, mummi”) you’re in luck. But if you are a Li Y, Zhang L or even an Aerts J it’s a bit harder.
5

Nextgen sequencing is making a huge impact on how research is done in the genomics field. One of the ways to discover structural variants in a genome for example is to create a clone library for an individual, sequence the ends of those clones and then map those ends to the reference genome. Suppose that the clones in the library are all 150kb large, then we would expect the ends of each clone to be mapped about 150kb from each other on that reference genome, in a forward/reverse direction.
3
Welcome
Welcome
Hi there, and welcome to SaaienTist, a blog by me, for me and you. It started out long ago as a personal notebook to help me remind how to do things, but evolved to cover more opinionated posts as well. After a hiatus of 3 to 4 years (basically since I started my current position in Belgium), I resurrect it to help me organize my thoughts. It might or might not be useful to you.

Why "Saaien tist"? Because it's pronounced as 'scientist', and means 'boring bloke' in Flemish.
About Me
About Me
Tags
Blog Archive
Links
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.