Ryo Sakai reminded me a couple of weeks ago about Simon Sinek's excellent TED talk "Start With Why - How Great Leaders Inspire Action"; which inspired this post... Why do I do what I do?

The way data can be analysed has been automated more and more in the last few decades.

2

"I'll do Angelina Jolie". Never thought I'd say that phrase while talking to well-known Belgian cartoonists, and actually be taken serious.

Backtrack about one year.

We could still use more applicants for this position, so bumping the open position...

SymBioSys is a consortium of computational scientists and molecular biologists at the University of Leuven, Belgium focusing on how individual genomic variation leads to disease through cascading effects across biological networks (in specific types of constitutional disorders and cancers).

Since the publication of the human genome sequence about a decade ago, the popular press has reported on many occasion about genes allegedly found for things ranging from breast size, intelligence, popularity and homosexuality to fidgeting.

2

Bit of a technical post for my own reference, about visualization and scripting in clojure.

Clojure and visualization

Being interested in clojure, a tweet by Francesco Strozzi (@fstrozzi) caught my attention last week: "A D3 like #dataviz project for #clojure. Codename C2 and looks promising.

Finally time to write something about the biovis/visweek conference I attended about a week ago in Providence (RI)... And I must say: they'll see me again next year. (Hopefully @infosthetics will be able to join me then).

I was invited last week to give a talk at this year's meeting of the Graduate School Structure and Function of Biological Macromolecules, Bioinformatics and Modeling (SFMBBM). It ended up being a day with great talks, by some bright PhD students and postdocs.

4

Last Friday I received my long-anticipated copy of "Visualize This" by Nathan Yau. On its website it is described as a "practical guide on visualization and how to approach real-world data".

UPDATE: I encountered a blog post by Martin Theus describing a very similar approach for looking at this same data (see here).

Disclaimer 1: This is a (very!) quick hack. No effort was put in it whatsoever regarding aesthetics, interactivity, scaling (e.g. in the barcharts), ...

A couple of days ago I bumped into this tweet by Benjamin Wiederkehr (@datavis): "Article: TenderNoise http://datavis.ch/q9pIxq" It describes a visualization by Stamen Design and others displaying noise levels at different intersections in San Francisco.

Preamble: It's been very quiet on this blog since I left the Wellcome Trust Sanger Institute in the UK and took my position here at Leuven University in Belgium last October.

Has been a while (again) since my last post. It seems that the requirements on my time are just a little bit different from during my previous position... But I'd like to share a little bit about the VizBi conference that I attended 2 weeks ago.

As a colleague of mine said a couple of weeks ago: "if you don't publish it, it didn't happen". Scientific publications are the currency to advance a researcher's career.

A lot of the work I do involves extracting data from VCF files ("Variant Call Format"; see http://bit.ly/apUbi8). It's tab-delimited but not quite: some of the columns contains structured data rather than just a value, and the format of these columns might even be different for every single line.

Just a short note...

Even though my position in Leuven only starts in October, I've already been involved in writing and defending a major grant.

I have been a bit frustrated lately by the fact that for many of my analyses I have to write a ruby script to mangle my data first, then resort to R to add a statistic to each of the datapoints, go back to ruby to mangle the result, repeat, rinse, and finally make plots in R.

I should create an online labbook with code examples of how I do things. Keep going back to an example script I have to copy/paste the code for handling different threads in ruby.

1

Read the excellent post by Neil Saunders on using ruby and mongodb to archive his posts on FriendFeed, prompting me to finally write down my own experiences with mongodb. So here goes...

Let's have a look at the pilot SNP data from the 1000genomes project.

Received an email this week from Sanger helpdesk that they installed a test hadoop system on the farm with 2 nodes. Thanks guys! First thing to do, obviously, was to repeat the streaming mapreduce exercise I did on my own machine (see my previous post).

Photo by niv available from Flickr

I have long been interested in trying out mapreduce in my data pipelines. The Wellcome Trust Sanger Institute has several huge compute farms that I normally use, but they don't support mapreduce jobs.

Worked a couple of days on pARP, the circular genome browser, and I think it's ready to be tested out by others. Consider this an alpha release: expect a lot of issues. It's easy to create regions with a negative length, for example.

"Contigs should not know where they are." That's a phrase uttered by James Bonfield when presenting his work on gap5, the successor to gap4, a much-used assembly software suite.

2

Back before the human genome was fully sequenced and NCBI, UCSC and Ensembl started working on visualization, it made a lot of sense to go for linear representations and use tracks for annotation. After all: chromosomes are linear.

4

Image by Danny McL via Flickr

There’s been quite a lot of discussions going on lately about author identification: Raf Aerts’ correspondence piece in Nature (doi:10.1038/453979b), discussions on FriendFeed, ...

5

Nextgen sequencing is making a huge impact on how research is done in the genomics field. One of the ways to discover structural variants in a genome for example is to create a clone library for an individual, sequence the ends of those clones and then map those ends to the reference genome.

3

Image by Kaeru via Flickr

I've recently started using raw visualizations to get an idea of what data looks like rather than writing scripts to summarize. And what I found is that presenting data visually in a raw format might be more useful than condensing everything down into just a few numbers.

3

Today is "Data management, mining, curation and visualization" day at the Genome Informatics conference in Hinxton. It might be one of the more interesting ones for me, because that's what I do: manage, mine, curate and attempt to visualize. And I must say the last bit the most difficult.

1

After investigating git for the bioruby project, I started using it on basically every project I run. And what do I use it for? Two things: keeping track of changes (duh) and syncing between server and laptop.

7

Disclaimer: This blog post is the result of several iterations of writing/discussion/rewriting from Anthony Underwood, Michael Barton, Matt Wood and myself, with additional help from Paul Thornthwaite.

4

Just a quick plug to see if I can find people interested in helping me out in some of my projects.

In the last 2 years, I started four open source projects (well: the last one was today...), each of which scratches my own itch and does what it needs to do for me.

3

It's been a while since my last post. Left my last job, was unemployed for a month (while still chairing a session at a conference), and just started my new position here at the Sanger Institute.

On every job you're able to pick up some new things that can help you out later.

4

Did you ever have data lying around that you couldn't figure out where you got it from?

You downloaded and imported data from an FTP site into your database ages ago and you actually want to use it now.

4

Seasoned programmers know this: testing should be an integral part of developing any script/program/software suite. Part and parcel is the unit test, where you test every little aspect of your program little by little.

One of the issues in a library like Bio::Graphics, is the plethora of glyph types that users will want. Here's a little showcase of what's provided by the library:

Features on a DNA sequences can be represented as filled boxes, open boxes, boxes with arrows, lines, triangles, ...

3

Saw this webcast a couple of weeks ago where Marcel Molina explains the notion of beautiful code. And I really recommend anyone writing code to have a look at it (totally irrespective of the fact he uses a ruby example...).

3

One of the main disadvantages of using ruby that I bump into is the absence of named arguments (or keyword parameters). That's no problem for methods taking just two or three arguments, but it does get confusing when you have to be able to pass more than that.

7

A number of people asked me recently about the usability of ruby/bioruby and if it would be worthwhile for them to take the plunge and investigate bioruby more. So I thought writing up here would be a good idea...

3

Do you have some of those projects where you have to be sure that you jump the same loops every time you edit some code? Take a look at the bio-graphics code.

5

As a follow up to my post on Bio::Graphics, I tried integrating this library in a rails application. After all, you'd get your data either from a file (like GFF) or a database. And let me tell you: it took me just 30 minutes or so to get a proof-of-concept running.

2

Having known and used the Generic Genome Browser (aka gbrowse, see here) for years now, it occured to me a while ago that it should be o so simple to create the same functionality with a much easier setup if we could use ruby instead of perl.

Gbrowse depends on bioperl's Bio::Graphics module.

7

Modeling genetics or genomics data presents its own challenges. One of the issues is that the actual definition of things change over time. A database system can only be based on the scientific knowledge at the time of conception.

6

"Joy to the world, lalaa la laaaa." I can finally announce that I've released the ruby API to the Ensembl core database under the bioruby-annex umbrella. Go here for the release.

1

Working on a ruby API for the Ensembl databases, I bumped into the issue of having to connect to a database without knowing its name.

The ensembl database server hosts databases for each species. Every two months or so, there's a new release which means a new database for every single species.

A quick glance at the side of my desk reveals two stacks of manuscripts to read; each stack about 20cm high. Sounds familiar? There seems to be a major task in front of me to process all that.

First thing to do is to identify what caused those piles in the first place.

4
Welcome
Welcome
Hi there, and welcome to SaaienTist, a blog by me, for me and you. It started out long ago as a personal notebook to help me remind how to do things, but evolved to cover more opinionated posts as well. After a hiatus of 3 to 4 years (basically since I started my current position in Belgium), I resurrect it to help me organize my thoughts. It might or might not be useful to you.

Why "Saaien tist"? Because it's pronounced as 'scientist', and means 'boring bloke' in Flemish.
About Me
About Me
Tags
Blog Archive
Links
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.