Received an email this week from Sanger helpdesk that they installed a test hadoop system on the farm with 2 nodes. Thanks guys! First thing to do, obviously, was to repeat the streaming mapreduce exercise I did on my own machine (see my previous post). Only difference with my local setup is that this time I had to handle HDFS.

As a recap from my previous post: I'll be running an equivalent of the following:

cat snps.txt | ruby snp_mapper.rb | sort | ruby snp_reducer.rb

Setting up

First thing the Sanger wiki told me was to format my HDFS space:hadoop namenode -format This apparently only affects my own space... After that I could start playing with hadoop.

Photo by niv available from Flickr

I have long been interested in trying out mapreduce in my data pipelines. The Wellcome Trust Sanger Institute has several huge compute farms that I normally use, but they don't support mapreduce jobs. Quite understandable from the IT's and institute's point of view because it's a mammoth task to keep those things running. But it also means that I can't put a foot on the mapreduce path.

There are options to run mapreduce on your own, however.
Welcome
Welcome
Hi there, and welcome to SaaienTist, a blog by me, for me and you. It started out long ago as a personal notebook to help me remind how to do things, but evolved to cover more opinionated posts as well. After a hiatus of 3 to 4 years (basically since I started my current position in Belgium), I resurrect it to help me organize my thoughts. It might or might not be useful to you.

Why "Saaien tist"? Because it's pronounced as 'scientist', and means 'boring bloke' in Flemish.
About Me
About Me
Tags
Blog Archive
Links
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.