<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-4867372421772813569</id><updated>2011-12-13T20:45:00.235+01:00</updated><category term='ruby'/><category term='incanter'/><category term='pARP'/><category term='data integration'/><category term='openresearchcomputation'/><category term='locustree'/><category term='visualization'/><category term='structural variation'/><category term='data management'/><category term='technical'/><category term='clojure'/><category term='organization'/><category term='deBruijn'/><category term='ActiveRecord'/><category term='graphics'/><category term='ucsc'/><category term='api'/><category term='mongodb'/><category term='bioinformatics'/><category term='mapreduce'/><category term='hadoop'/><category term='literature'/><category term='ensembl'/><category term='GTD'/><category term='annotation'/><category term='git'/><category term='opinion'/><category term='rails'/><category term='productivity'/><category term='testing'/><category term='genvizlab'/><category term='aws'/><category term='bioruby'/><category term='science'/><category term='database'/><title type='text'>Saaien Tist</title><subtitle type='html'>On data visualization, bioinformatics and personal productivity</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>47</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-4261116909706615258</id><published>2011-11-10T16:09:00.000+01:00</published><updated>2011-11-10T20:05:21.893+01:00</updated><title type='text'>Biovis/Visweek recap</title><content type='html'>Finally time to write something about the biovis/visweek conference I attended about a week ago in Providence (RI)... And I must say: they'll see me again next year. (Hopefully @infosthetics will be able to join me then).&amp;nbsp;Meanwhile, several blog posts are popping up discussing it (see &lt;a href="http://infosthetics.com/archives/2011/11/most_interesting_papers_at_infovis_visweek_2011.html"&gt;here&lt;/a&gt; and &lt;a href="http://blogger.ghostweather.com/2011/10/personal-take-on-infovis-2011.html"&gt;here&lt;/a&gt;, for example).&lt;br /&gt;&lt;br /&gt;This was the first time that &lt;b&gt;&lt;a href="http://www.biovis.net/"&gt;biovis&lt;/a&gt;&lt;/b&gt;&amp;nbsp;(aka the IEEE Symposium on Biological Data Visualization) was organized. It's similar to the 2-year old &lt;a href="http://www.vizbi.org/"&gt;vizbi&lt;/a&gt;, but has an agenda that is more focused on research in visualization rather than application of visualization in the biological sciences. Really interesting talks, posters and people.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The biovis contest&lt;/b&gt;&lt;br /&gt;This first installment of biovis included a data visualization contest, focusing on "specific biological problem domains, and based on realistic domain data and domain questions". The topic this year was on eQTL (expression quantitative trait loci) data, and I'm really happy that &lt;b&gt;Ryo Sakai&lt;/b&gt; -&amp;nbsp;&amp;nbsp;now a PhD student in my lab - won the &lt;b&gt;"biologists' favourite" award&lt;/b&gt;!! The biologist jury was impressed with the ease in which his visualizations of the eQTL data highlighted and confirmed prior knowledge, and how it suggested directions for further experiments. It was interesting to see that there was a huge variation in the submissions, going from just showing the raw data in an interesting way (which Sakai-san did) to advanced statistical and algorithmic munging of the input data and visualizing the end result (which the winner of the "dataviz professionals' favourite" award did). See how this relates to my previous post on &lt;a href="http://saaientist.blogspot.com/2011/10/humanizing-bioinformatics.html"&gt;humanizing bioinformatics&lt;/a&gt;?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Interesting talks - amazing (good &amp;amp; bad) talks&lt;/b&gt;&lt;br /&gt;As this was the first time that I attended visweek, I was really looking forward to the high quality presentations/papers and posters. Overall, I got what I wanted. But there were some examples of papers and posters that I have major doubts about (taking into account that I have to be humble here in talking about people working in the field for far longer than I do).&lt;br /&gt;One example that seemed pretty counterintuitive was a presentation by Basak Alper from Microsoft about a new set visualization technique that they baptized&amp;nbsp;&lt;b&gt;LineSets&lt;/b&gt;. The main issue that they want to solve is the visualization of intersections of &amp;gt;3 sets (up to 3 you'd just use Venn diagrams). Their approach is to connect the different elements from a set by a line; hence: linesets. However, I (and many others with me) felt that this approach has some very serious drawbacks. Most of all, it suggests that there is an implicit ordering of the elements, which there isn't. In the image below, for example, line sets were used to connect Italian restaurants (in orange) and Mexican restaurants (in purple). That's the only thing this visualization wants to do: tell me which of the restaurants are Italian and which are Mexican. But give this picture to 10 people, and every single one of them will think that the lines are actually paths or routes between these restaurants. Which they're not... The example below shows data that has specific positions on a map, but they demonstrate this approach&amp;nbsp;on social networks as well.&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-rdLntmO8klk/TrvettG6igI/AAAAAAAADf4/J_gDGkjQTdA/s1600/Screen+Shot+2011-11-10+at+15.02.06.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="320" src="http://3.bp.blogspot.com/-rdLntmO8klk/TrvettG6igI/AAAAAAAADf4/J_gDGkjQTdA/s320/Screen+Shot+2011-11-10+at+15.02.06.png" width="319" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;LineSets&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;Another example comes from the biovis conference: TIALA or &lt;b&gt;Time Series Alignment Analysis&lt;/b&gt;. Suppose you have the time-dependent gene expression values for a single gene, which you'd plot using a line plot. Now what would you do if you have that type of data for 100 genes? Would you put those plots into 3D? I know I wouldn't... And better still: would you then connect these plots so you end up with some sort of 3D-landscape? That's like connecting the tops of a barchart displaying categorical data with a line...&lt;br /&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-EvIu2-Jr7Hw/Trvf2G-Z5yI/AAAAAAAADgA/mmoypL8lU_Q/s1600/Screen+Shot+2011-11-10+at+15.29.18.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="218" src="http://3.bp.blogspot.com/-EvIu2-Jr7Hw/Trvf2G-Z5yI/AAAAAAAADgA/mmoypL8lU_Q/s320/Screen+Shot+2011-11-10+at+15.29.18.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;TIALA - Time Series Alignment Analysis&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;But of course there were &lt;b&gt;plenty really good talks&lt;/b&gt; as well. Some of the talks I really enjoyed are those about &lt;b&gt;HiTSEE&lt;/b&gt; (by Bertini et al) on the analysis of high-throughput screening experiments, &lt;b&gt;EVEVis&lt;/b&gt; (Miller et al) on multi-scale visualization for dense evolutionary data, &lt;b&gt;arc length-based aspect ratio selection&lt;/b&gt; (Talbot et al) which is an alternative to banking to 45 degrees, &lt;b&gt;drawing road networks with focus regions&lt;/b&gt;&amp;nbsp;(Haunert et al), and especially &lt;b&gt;DICON&lt;/b&gt;&amp;nbsp;which showed an amazing application of visual analysis of multidimensional clusters using healthcare data.&lt;br /&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-nNjPqP0tbLI/TrvjyFhoFLI/AAAAAAAADgI/a4VlnMmMZk0/s1600/Screen+Shot+2011-11-10+at+15.45.35.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="200" src="http://1.bp.blogspot.com/-nNjPqP0tbLI/TrvjyFhoFLI/AAAAAAAADgI/a4VlnMmMZk0/s200/Screen+Shot+2011-11-10+at+15.45.35.png" width="190" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;HiTSEE&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-GiILyyg4e0k/TrvkS-WmS5I/AAAAAAAADgQ/0QvYYr4BLcA/s1600/Screen+Shot+2011-11-10+at+15.47.30.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="196" src="http://4.bp.blogspot.com/-GiILyyg4e0k/TrvkS-WmS5I/AAAAAAAADgQ/0QvYYr4BLcA/s200/Screen+Shot+2011-11-10+at+15.47.30.png" width="200" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;EVEVis&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-kuIVVUg_Z8Y/Trvlt_a5nfI/AAAAAAAADgY/4KeXIcOreN8/s1600/Screen+Shot+2011-11-10+at+15.54.18.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="165" src="http://3.bp.blogspot.com/-kuIVVUg_Z8Y/Trvlt_a5nfI/AAAAAAAADgY/4KeXIcOreN8/s320/Screen+Shot+2011-11-10+at+15.54.18.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Road networks with focus regions&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-FyLFHrlmK7Y/TrvmCUqpKCI/AAAAAAAADgg/E3DQOLt9Xqw/s1600/Screen+Shot+2011-11-10+at+15.55.42.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="178" src="http://3.bp.blogspot.com/-FyLFHrlmK7Y/TrvmCUqpKCI/AAAAAAAADgg/E3DQOLt9Xqw/s200/Screen+Shot+2011-11-10+at+15.55.42.png" width="200" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;DICON - interactive visual analysis of multidimensional clusters&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;b&gt;Meeting interesting people&lt;/b&gt;&lt;br /&gt;But of course this was very much about meeting interesting people as well. It was really nice to exchange ideas again with the biovis crowd (Nils Gehlenborg, Cydney, Tamara, Will Ray, ...), and I finally had the chance to have a chat with @filwd Enrico. All those discussions with Thorri from Icelandic DataMarket were both useful and fun (as was our day hanging out in town, chatting to the Occupy Providence woman (forgot her name, I'm afraid) and trying to find a good hat).&lt;br /&gt;At the airport on my way back, as I was trying to find out how to get to Brussels (as our flights were cancelled due to the weather), a chap comes to me and introduces himself as someone from Belgium. From Leuven. From our very own faculty. So together with @infosthetics Andrew that now makes three of us :-)&lt;br /&gt;&lt;br /&gt;Anyway: I'll definitely be back next year (have to play some more official role anyway) and already looking forward to it.&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-4261116909706615258?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4261116909706615258'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4261116909706615258'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2011/11/biovisvisweek-recap.html' title='Biovis/Visweek recap'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-rdLntmO8klk/TrvettG6igI/AAAAAAAADf4/J_gDGkjQTdA/s72-c/Screen+Shot+2011-11-10+at+15.02.06.png' height='72' width='72'/></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-4891737196826652192</id><published>2011-10-19T17:12:00.000+02:00</published><updated>2011-10-19T17:12:54.069+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><title type='text'>Humanizing Bioinformatics</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-yJh-bWN2E04/Tpx63f2TK5I/AAAAAAAADdY/yt7JWjd6e30/s1600/remake_meander_2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="197" src="http://4.bp.blogspot.com/-yJh-bWN2E04/Tpx63f2TK5I/AAAAAAAADdY/yt7JWjd6e30/s200/remake_meander_2.png" width="200" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;I was invited last week to give a talk at this year's meeting of the Graduate School Structure and Function of Biological Macromolecules, Bioinformatics and Modeling (SFMBBM). It ended up being a day with great talks, by some bright PhD students and postdocs. There were 2 keynotes (one by Prof Bert Poolman from Groningen (NL) and one by myself), and a panel discussion on what the future holds for people nearing the end of their PhDs.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;My talk was titled "Humanizing Bioinformatics" and received quite well (at least some people still laughed at my jokes (if you can call them that); even at the end). I put the slides up on &lt;a href="http://www.slideshare.net/jandot/keynote-sfmbbm-2011"&gt;slideshare&lt;/a&gt;, but I thought I'd explain things here as well, because those slides will probably not convey the complete story.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Let's ruin the plot by mentioning it here: we need data visualization to counteract the &lt;b&gt;alienation&lt;/b&gt; that's happening between bioinformaticians and bright data miners on the one hand, and the user/clinician/biologist on the other. We need to &lt;b&gt;make bioinformatics human again&lt;/b&gt;.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Jim Gray from Microsoft wrote a very interesting book "&lt;b&gt;The Fourth Paradigm - Data-intensive Scientific Discovery&lt;/b&gt;". Get it. Read it. He describes how the practice of research has changed over the centuries. In the First Paradigm, science was very much about &lt;b&gt;describing things&lt;/b&gt;; the Second Paradigm (last couple of centuries) saw a more &lt;b&gt;theoretical&lt;/b&gt; approach, with people like Keppler and Newton defining "laws" that described the universe around them. The last few decades saw the advent of computation in the research field, which allowed us to take a closer look at reality by &lt;b&gt;simulating&lt;/b&gt; it (the Third Paradigm). But just recently - so Jim Gray says - we're moving into yet another fundamental way of doing science. We have moved into an age where there is just so much data generated that we don't know what to do with it. This Fourth Paradigm is that of &lt;b&gt;data exploration&lt;/b&gt;. As I see it (but that's just one way of looking at it, and it doesn't want to say anything about what's "better" than what), this might be a definition for the difference between computational biology and bioinformatics: &lt;b&gt;computational biology fits within the Third Paradigm&lt;/b&gt;, while &lt;b&gt;bioinformatics fits in the Fourth&lt;/b&gt;.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Being able to automatically generate these huge amounts of data (&lt;i&gt;e.g.&lt;/i&gt;&amp;nbsp;in genome sequencing) does mean that biologists have to work with &lt;b&gt;ever bigger datasets&lt;/b&gt;, using &lt;b&gt;ever more advanced algorithms&lt;/b&gt; that use &lt;b&gt;ever more complicated data structures&lt;/b&gt;. This is not about just some summary statistics anymore; it's support vector machine recursive feature elimination, manifold learning and adaptive cascade sharing trees and stuff. Result: biologist is at a loss. Remember Dr McCoy in Star Trek saying "Dammit Jim, I'm a doctor, not an electrician/cook/nuclear physicist" whenever the captain let him do stuff that is - well - not doctorly? (Great analogy found by Christophe Lambert). It's exactly the same for a clinician nowadays. In order to do &lt;i&gt;a&lt;/i&gt; (his job: &lt;i&gt;e.g.&lt;/i&gt;&amp;nbsp;decide on a treatment plan for a cancer patient), he has to first do &lt;i&gt;b&lt;/i&gt;&amp;nbsp;(set up hardware that can handle the 100s of Gb of data) and &lt;i&gt;c&lt;/i&gt;&amp;nbsp;(devise some nifty data mining trickery to get his results). Neither of which he has the time or training for. "&lt;b&gt;Dammit Jim, I'm a doctor, not a bioinformatician&lt;/b&gt;". Result: we're alienating the user. Data mining has become so complicated and advanced, that the clinician is at a complete loss. Heck, I'm working at a bioinformatics department and don't understand half of what they're talking about. So what can the clinician do? His only option is to trust some bioinformatician to come up with some results. But this is a &lt;b&gt;blind trust&lt;/b&gt;: he has no way of assessing the results he gets back. This trust is even more blind than the one you give the guy who repairs your car.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;As I see it, there are (at least) four issues.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;What's the question?&lt;/b&gt;&lt;br /&gt;Data generation used to be really geared towards proving or disproving a specific hypothesis. The researcher would have a question, formulate some hypothesis around it, and then generate data. Although that same data could already be used to answer other unanticipated questions as well, this really became an issue with easy, automated data generation; DNA sequencing being a prime example. You might ask yourself "does this or that gene have a mutation that lead to this disease?", but the data you generate (&lt;i&gt;i.c.&lt;/i&gt;&amp;nbsp;exome sequences)&amp;nbsp;to answer this question can be used to answer hundreds of other questions as well. You just don't know what questions yet...&lt;br /&gt;Statistical analysis and data mining are indispensable for (dis)proving hypothesis, but what if we don't know the hypothesis? As many others in the field, I believe that data visualization can give us some clues at what to investigate further.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-b3fdQM1Mkjc/Tp05yssYYlI/AAAAAAAADdo/k0bgUeTZU70/s1600/network.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="281" src="http://1.bp.blogspot.com/-b3fdQM1Mkjc/Tp05yssYYlI/AAAAAAAADdo/k0bgUeTZU70/s400/network.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;Let's for example look at this example &lt;a href="http://mkweb.bcgsc.ca/linnet/"&gt;hive plot&lt;/a&gt; by Martin Krzewinski (for what B means: see the explanation at the hive plot website). Suppose you're given a list of genes in &lt;i&gt;E.coli&lt;/i&gt;&amp;nbsp;(or a list of functions in the linux operating system) and the network between those genes (or functions). Using clever visualization, we can define some interesting questions that we can look into using statistics or data mining. For example: why do we see so many workhorse genes in &lt;i&gt;E.coli&lt;/i&gt;? Does this reflect reality, and what would that mean? Or does it mean that our input network is biased? What is so special about that very small number of workhorse functions in linux that have that high connectivity? These are questions that we need to be presented to us.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;What parameters should I use?&lt;/b&gt;&lt;br /&gt;Second issue: the outcome from most data mining/filtering algorithms depend tremendously on the right parameters. But it can be very difficult to actually find out what those parameters should be. Does there actually &lt;i&gt;exist&lt;/i&gt;&amp;nbsp;a "right" set of parameters for this or that algorithm? Also, tweaking some arguments just a little bit can have vast effects on the results, while you can change other parameters as much as you want, but it won't affect the outcome whatsoever.&lt;br /&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-OLlSilr13Pw/Tp0-Ca4TEiI/AAAAAAAADdw/Kiqg6CA6G6M/s1600/drawing.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" height="196" src="http://1.bp.blogspot.com/-OLlSilr13Pw/Tp0-Ca4TEiI/AAAAAAAADdw/Kiqg6CA6G6M/s320/drawing.png" width="320" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;Turnbull &lt;i&gt;et al&lt;/i&gt;. Nature Genetics 2010&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;b&gt;Can I trust this output?&lt;/b&gt;&lt;br /&gt;Issue number 3: if I am a clinician/biologist and a bioinformatician hands me some results, how do I know if I can trust those results? Heck, being a bioinformatician myself and writing some program to filter putative SNPs, how do I know that my results are correct? Suppose there are 3 filters that I can apply consecutively, with different combinations of settings.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-bbzIlhqOWBc/Tp26PXKHq-I/AAAAAAAADd4/TbQ0gsQLPa4/s1600/filter.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="293" src="http://4.bp.blogspot.com/-bbzIlhqOWBc/Tp26PXKHq-I/AAAAAAAADd4/TbQ0gsQLPa4/s400/filter.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;Looking at exome data, we main information that we can use for assessing the results of SNP filtering are the fact that you should end up with 20k-25k SNPs, and a transition/transversion ratio of 2.1 (if I remember correctly). But there's many different combinations of filters that can give these summary statistics. The state of the art (believe it or not) is to just run many different algorithms and filters independently, and then take the intersection of the results...&lt;br /&gt;&lt;br /&gt;&lt;b&gt;I can't wrap my head around this...&lt;/b&gt;&lt;br /&gt;And finally, there's the issue of too much information. Not just the sheer amount, but of different data sources. It's actually not really too much information per se, but too much to keep into one head. Someone trying to decide on a treatment plan for a cancer patient, for example, will have to combine data from heterogeneous datasets, multiple abstraction levels and multiple sources. He'll have to look into patient and clinical data, family/population data, MR/CT/Xray scans, tissue samples, gene expression data and pathways. That's just too much. His &lt;b&gt;cognitive capacities are fully engaged in trying to integrate all that information&lt;/b&gt;, rather than in answering the initial question.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Visualization... part of the solution&lt;/b&gt;&lt;br /&gt;I'm not saying anything new here when I suggest that data visualization might be part of the solution to these problems. &lt;b&gt;As current technologies and analysis methods have alienated the end-user from his own results, visualization can reach over and cross this gap&lt;/b&gt;. The rest of the presentation is basically about some basic principles in data visualization, which I'll not go further into here.&lt;br /&gt;&lt;br /&gt;All in all, I think the presentation went quite well:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-uoFh4peNouA/Tp3Ak31RSoI/AAAAAAAADeA/o1VkFpn02IE/s1600/people-paying-attention.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://2.bp.blogspot.com/-uoFh4peNouA/Tp3Ak31RSoI/AAAAAAAADeA/o1VkFpn02IE/s320/people-paying-attention.png" width="276" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-4891737196826652192?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/4891737196826652192/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2011/10/humanizing-bioinformatics.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4891737196826652192'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4891737196826652192'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2011/10/humanizing-bioinformatics.html' title='Humanizing Bioinformatics'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-yJh-bWN2E04/Tpx63f2TK5I/AAAAAAAADdY/yt7JWjd6e30/s72-c/remake_meander_2.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-1401927340725551520</id><published>2011-09-01T10:27:00.003+02:00</published><updated>2011-09-01T10:27:46.625+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><title type='text'>Visualize This (by Nathan Yau) arrived...</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-c_Jq7L-Wrl4/Tl83YoXHVxI/AAAAAAAADdE/hOYKW-PgxtI/s1600/visualize-this-drop.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://2.bp.blogspot.com/-c_Jq7L-Wrl4/Tl83YoXHVxI/AAAAAAAADdE/hOYKW-PgxtI/s320/visualize-this-drop.png" width="253" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;Last Friday I received my long-anticipated copy of &lt;a href="http://book.flowingdata.com/"&gt;"&lt;b&gt;Visualize This&lt;/b&gt;" by Nathan Yau&lt;/a&gt;. On its website it is described as a "&lt;b&gt;practical guide on visualization and how to approach real-world data&lt;/b&gt;".&amp;nbsp;You can guess what my weekend looked like :-)&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Overall, I believe this book is a very &lt;b&gt;good choice&lt;/b&gt; for people interested in getting started in data visualization. Not only does it provide the context in which to create visualizations (chapters 1, 2 and 9), it also handles different &lt;b&gt;tools&lt;/b&gt; for creating them: R, protovis, flash.... Apart from chapter 3 that is dedicated entirely to that topic, different examples in the book were created using different tools, which gives people a good feel of what's possible in each and how "hard" or "easy" the coding itself is for the different options. Different chapters discuss &lt;b&gt;different types of data&lt;/b&gt; that you could encounter: patterns over time, proportions, relationships, ...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There were some minor points in the book that I'd mention if they asked me to review it (but that's according to me, and I don't want to pretend to be an expert). First of all, it would have been nice if Nathan had gone a little bit deeper into &lt;b&gt;theories behind what is seen as good visualization&lt;/b&gt;. In the first chapter ("Telling Stories with Data") he does mention Cleveland &amp;amp; McGill in a side-note, but I think that information (along with Gestalt laws, etc) definitely deserves one or two full paragraphs, if not half a chapter. I also don't completely agree with the use of a &lt;b&gt;stacked barchart&lt;/b&gt; (about page 109). From my experience, they're worth less than the time it takes to create them. After all, it's impossible to compare any groups other than the one that is at the bottom (and therefore has a common "zero"-line). For example: look at the first picture below. This shows the number of "stupid things done" by women and men, stratified over 5 different groups (A-F). Although it is&amp;nbsp;easy to compare &lt;i&gt;total&lt;/i&gt;&amp;nbsp;stupidity per group (group C is doing particularly bad), as well as that for men, we can't see which of the groups A, D or F scores the worst for women. And that's because they don't have a common origin. We could of course put the women next to the men, but then we'd loose the total numbers.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-kF6rbxUjeAg/Tl8-QBNgoiI/AAAAAAAADdI/QNQfKNef8J4/s1600/stacked_bar_bad.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="282" src="http://3.bp.blogspot.com/-kF6rbxUjeAg/Tl8-QBNgoiI/AAAAAAAADdI/QNQfKNef8J4/s400/stacked_bar_bad.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-F2BGR8rf_bw/Tl8-SET_UaI/AAAAAAAADdM/U5ojaecro9Q/s1600/stacked_bar_reformatted.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="282" src="http://4.bp.blogspot.com/-F2BGR8rf_bw/Tl8-SET_UaI/AAAAAAAADdM/U5ojaecro9Q/s400/stacked_bar_reformatted.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;In the second plot, however, it is &lt;b&gt;possible to compare women, men and totals&lt;/b&gt;. The bars for women are put next to those for men, but I've added a &lt;b&gt;shaded larger bar at the back&lt;/b&gt; that shows the sum of the two. This plot was originally created in R using &lt;b&gt;ggplot2&lt;/b&gt;, but I'm afraid I can't find back the reference that explained how to do this... Let me know if you can find it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The contents of the book of course is not world-shattering. But that's not the point of the book. For people new to the field it's a &lt;b&gt;great addition to their library&lt;/b&gt; (and I learned a thing or two myself as well). If you're interested in data visualization, go out and get it.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-1401927340725551520?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/1401927340725551520/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2011/09/visualize-this-by-nathan-yau-arrived.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/1401927340725551520'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/1401927340725551520'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2011/09/visualize-this-by-nathan-yau-arrived.html' title='Visualize This (by Nathan Yau) arrived...'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-c_Jq7L-Wrl4/Tl83YoXHVxI/AAAAAAAADdE/hOYKW-PgxtI/s72-c/visualize-this-drop.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-4855308386442598863</id><published>2011-07-26T14:48:00.000+02:00</published><updated>2011-10-11T14:20:37.077+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><title type='text'>Visualizing the Tour de France</title><content type='html'>&lt;b&gt;UPDATE: I encountered a blog post by Martin Theus describing a very similar approach for looking at this same data (see&amp;nbsp;&lt;a href="http://www.theusrus.de/blog/tour-de-france-2011/"&gt;here&lt;/a&gt;).&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Disclaimer 1: This is a (very!) quick hack. No effort was put in it whatsoever regarding aesthetics, interactivity, scaling (e.g. in the barcharts), ... Just wanted to get a very broad view of what happened during the Tour de France (= biggest cycling event each year).&lt;br /&gt;Disclaimer 2: I don't know &lt;i&gt;anything&lt;/i&gt;&amp;nbsp;about cycling. It was actually my wife who had to point out to me which riders could be interesting to highlight in the visualization. But that also meant that this could become interesting for me to learn something about the Tour.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-W-2Dcm3pUrk/Ti6vg8X1yZI/AAAAAAAADbw/vrzMBPc3whI/s1600/Screen+shot+2011-07-25+at+09.51.16.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="289" src="http://2.bp.blogspot.com/-W-2Dcm3pUrk/Ti6vg8X1yZI/AAAAAAAADbw/vrzMBPc3whI/s640/Screen+shot+2011-07-25+at+09.51.16.png" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Data was copied from the Tour de France website (e.g. for the &lt;a href="http://www.letour.fr/2011/TDF/LIVE/us/100/classement/index.html"&gt;1st stage&lt;/a&gt;). Visualization was created in processing.&lt;br /&gt;&lt;br /&gt;The &lt;b&gt;parallel coordinate plot&lt;/b&gt; shows the standings of all riders over all 21 stages. No data was available for stage 2, because that was a team time-trial (so discard that one). At the top is the rider who came first, at the bottom who came last. Below the coordinate plot are little &lt;b&gt;barcharts&lt;/b&gt; displaying the distribution in arrival time (in "number of seconds later than the winner") for all riders in that stage.&lt;br /&gt;&lt;br /&gt;The &lt;b&gt;highlighted riders&lt;/b&gt; are: Cavendish (red), Evans (orange), Gilbert (yellow), Andy Schleck (light blue) and Frank Schleck (dark blue).&lt;br /&gt;&lt;br /&gt;So what was I able to learn from this?&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Based on the barcharts you can guess &lt;b&gt;which trips were in the mountains&lt;/b&gt;, and which weren't. You'd expect that the riders become much more separated in the mountains than on the flat. In the very last stage in Paris, for example, everyone seems to have arrived in one big group. Whereas for stages 12-14 the riders were much more spread. So my guess (and that's confirmed by checking this on the TourDeFrance website :-) is that those were mountain stages.&lt;/li&gt;&lt;li&gt;You can see clear &lt;b&gt;groups of riders who behave the same&lt;/b&gt;. There is for example a clear group of riders who performed quite badly in stage 19 but much better in stage 20 (and bad in 21 again).&lt;/li&gt;&lt;li&gt;As the parallel coordinate plots were scaled according to the initial number of riders, we can clearly see how &lt;b&gt;people left the Tour&lt;/b&gt;&amp;nbsp;because the "bottom" of the later stages are empty.&lt;/li&gt;&lt;li&gt;We see that Cavendish (red) has very erratic performance. And it seems to co-incide with trips where the arrival times are spread out (= mountain trips?). This could mean that &lt;b&gt;Cavendish is good on the flats, but bad in the mountains&lt;/b&gt;. Question to those who know something about cycling: is that true?&lt;/li&gt;&lt;li&gt;Philippe Gilbert started good (both on the flats and in the mountains), but became more erratic halfway through the Tour.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-4855308386442598863?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4855308386442598863'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4855308386442598863'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2011/07/visualizing-tour-de-france.html' title='Visualizing the Tour de France'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-W-2Dcm3pUrk/Ti6vg8X1yZI/AAAAAAAADbw/vrzMBPc3whI/s72-c/Screen+shot+2011-07-25+at+09.51.16.png' height='72' width='72'/></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-4355336695332525985</id><published>2011-07-13T20:11:00.001+02:00</published><updated>2011-07-13T20:17:50.016+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><title type='text'>TenderNoise - visualizing noise levels</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-PTwPYb0BSFE/Th3fuKXS7dI/AAAAAAAADaM/2BrTGPV_3hk/s1600/Screen+shot+2011-07-13+at+20.10.21.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://2.bp.blogspot.com/-PTwPYb0BSFE/Th3fuKXS7dI/AAAAAAAADaM/2BrTGPV_3hk/s320/Screen+shot+2011-07-13+at+20.10.21.png" width="296" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;A couple of days ago I bumped into this tweet by Benjamin Wiederkehr (@datavis): "Article: &lt;b&gt;TenderNoise&lt;/b&gt; &lt;a href="http://datavis.ch/q9pIxq"&gt;http://datavis.ch/q9pIxq&lt;/a&gt;" It describes a visualization by Stamen Design and others displaying &lt;b&gt;noise levels at different intersections in San Francisco&lt;/b&gt;. They recorded these levels over a period of a few days in order to get an idea of auditory pollution. More information is &lt;a href="http://ybuffet.posterous.com/tendernoise-the-visualization-of-noise"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Although this particular visualization might be very useful for the people involved, I would like to explain some of the issues that I have with it, coming from a data-visualization-for-pattern-finding viewpoint.&lt;br /&gt;&lt;br /&gt;I think there are many things that might be gleaned from this data which are not possible with the current visualization:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Is there a &lt;b&gt;relationship between the noise patterns at different intersections&lt;/b&gt;? Based on the graphic at the bottom, we can conclude that &lt;i&gt;on average&lt;/i&gt; noise level goes down during the night and up during daytime, but it would be nice if the visualization would give an indication of any aberrant patterns as well. &lt;b&gt;Are there intersections that behave differently from others&lt;/b&gt;?&lt;/li&gt;&lt;li&gt;I don't see a real use for &lt;b&gt;changing the graphic over time&lt;/b&gt;. I suspect that &lt;b&gt;small multiples of area charts&lt;/b&gt; would work better to demonstrate the change over time (as e.g. the visual used &lt;a href="http://bit.ly/maFhwf"&gt;here&lt;/a&gt;). Using the current approach it is very difficult to see how particular intersections change over time because (a) the display changes and you loose temporal context, and (b) the resolution is so hight that the blobs just flicker.&lt;/li&gt;&lt;li&gt;Concerning that flicker, it might be an option to &lt;b&gt;bin the data in larger time blocks&lt;/b&gt;. For calculating the value in each block different approaches should be investigated, like the average value, the maximum, the minimum, or the most extreme value (be it maximum or minimum, based on comparison with the average).&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;It'd be interesting to get hold of these data and work on some alternatives (given the time...)&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-4355336695332525985?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4355336695332525985'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4355336695332525985'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2011/07/tendernoise-visualizing-noise-levels.html' title='TenderNoise - visualizing noise levels'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-PTwPYb0BSFE/Th3fuKXS7dI/AAAAAAAADaM/2BrTGPV_3hk/s72-c/Screen+shot+2011-07-13+at+20.10.21.png' height='72' width='72'/></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-791666226597471655</id><published>2011-06-30T16:28:00.000+02:00</published><updated>2011-06-30T16:28:46.147+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><title type='text'>Why did I move into data visualization?</title><content type='html'>Preamble: It's been very quiet on this blog since I left the Wellcome Trust Sanger Institute in the UK and took my position here at Leuven University in Belgium last October. Truth is that the type of work changed so profoundly that it takes a while to give it all a place in your head; let alone a blog. Until I remembered this morning why I started this blog in the first place: to help me order my thoughts in the first place. So it might have sped things up instead, actually...&lt;br /&gt;&lt;br /&gt;Anyway... In this post I'd like to explain why I'm moving into the *data visualization* field. And it's not just because it's always nice to look at pretty pictures.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Statistics are great, but...&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Ben Schneiderman, Professor at the University of Maryland, very eloquently stated that "&lt;b&gt;The great fun of information visualization is that it gives you answers to questions you didn't know you had&lt;/b&gt;". I'd rephrase that a bit to "&lt;b&gt;The great use of data visualization is that it gives you clues to questions you didn't know you had&lt;/b&gt;". To me, it's as much about finding the questions to ask as about finding the answers to those questions. Data visualization should not be used to "prove" things; that's what statistics is for. But the visualization can give you ideas on what statistical models to test. As do many others, I see a strong connection between statistics and data visualization. Taking a bit of a shortcut here, you could say that &lt;b&gt;statistics is about proving what you expect, while visualization is about discovering what you didn't expect&lt;/b&gt;&amp;nbsp;and refining those expectations.&lt;br /&gt;&lt;br /&gt;From my own experience, I've seen that many (but not all!) statisticians look down upon data visualization with the argument that it can't proof anything. That's true. But their reaction then often becomes to throw away the baby with the bath water, instead of trying to see how both fields can benefit from each other. It's not always equally simple to convince people of the effectiveness of visualizations, but we're getting there...&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-vQNL4ATygg0/TgyHU1gYkJI/AAAAAAAADYw/mtNTxZcVjPA/s1600/statistics_vs_visualization.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="139" src="http://4.bp.blogspot.com/-vQNL4ATygg0/TgyHU1gYkJI/AAAAAAAADYw/mtNTxZcVjPA/s320/statistics_vs_visualization.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Explain and explore&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;In the data visualization field, there is often the tension between &lt;b&gt;explanation&lt;/b&gt; and &lt;b&gt;exploration&lt;/b&gt;. The work I'll be doing here in Leuven will cover the whole spectrum. In the &lt;b&gt;explanation&lt;/b&gt; corner, there is trying to make sense of complex data. For example helping cancer genetics researchers understand how tumours evolve (&lt;i&gt;e.g.&lt;/i&gt;&amp;nbsp;the phylogeny of cancer cells) or what the rearranged genome in those tumours looks like. This type of visualization sits downstream from the data analysis, after the data is churned. On the other hand, there are the &lt;b&gt;exploration&lt;/b&gt; projects, where we focus on showing the raw(ish) data to help us decide on what type of analysis to perform, for example for investigating parameter-space for an algorithm. Of course many projects will fit somewhere in the middle...&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;The visualization model&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Jarke van Wijk's paper "The Value of Visualization" (doi: 10.1109/VISUAL.2005.1532781) is a masterpiece in that it describes a comprehensive model of what visualization is and how we can quantify its effectiveness (cost). I'll just leave the picture here for you to contemplate over:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-Jk8TSRbcSKU/TgyHqtq3hdI/AAAAAAAADY0/K_oMxbPQNd4/s1600/jarkevanwijk.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="168" src="http://1.bp.blogspot.com/-Jk8TSRbcSKU/TgyHqtq3hdI/AAAAAAAADY0/K_oMxbPQNd4/s320/jarkevanwijk.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;b&gt;My inspiration&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There are several people whose work I keep in mind when discussing what I want to do in my group; my sources of inspiration, so to speak. They have a very important thing in common: they don't take shortcuts in their work and are not afraid to really think about what their visualizations are intended to do.&lt;br /&gt;&lt;br /&gt;These include:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Cydney Nielsen: &lt;a href="http://bit.ly/jXsi52"&gt;ABySS-Explorer&lt;/a&gt; - a sequence assembly visualization tool&lt;/li&gt;&lt;li&gt;Miriah Meyer: &lt;a href="http://bit.ly/maFhwf"&gt;Pathline&lt;/a&gt; - a tool for comparative functional genomics&lt;/li&gt;&lt;li&gt;Martin Krzywinski: &lt;a href="http://bit.ly/jh7zIo"&gt;Hive plots&lt;/a&gt; - rational network visualization&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;It's really exciting to work in this field; I'm looking forward to what the next few years will bring :-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-791666226597471655?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/791666226597471655'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/791666226597471655'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2011/06/why-did-i-move-into-data-visualization.html' title='Why did I move into data visualization?'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-vQNL4ATygg0/TgyHU1gYkJI/AAAAAAAADYw/mtNTxZcVjPA/s72-c/statistics_vs_visualization.png' height='72' width='72'/></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-5041878402389528854</id><published>2011-03-28T12:23:00.003+02:00</published><updated>2011-10-18T20:33:24.735+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><title type='text'>VizBi 2011 - looking back</title><content type='html'>&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-kCMB08h84ug/Tp3Gb98G5aI/AAAAAAAADes/QWT4kFHwGwo/s1600/poster.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-kCMB08h84ug/Tp3Gb98G5aI/AAAAAAAADes/QWT4kFHwGwo/s1600/poster.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;Has been a while (again) since my last post. It seems that the requirements on my time are just a little bit different from during my previous position... But I'd like to share a little bit about the VizBi conference that I attended 2 weeks ago.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;This second installment of the &lt;a href="http://vizbi.org/"&gt;VizBi&lt;/a&gt; conference was held at  the Broad Institute in Cambridge, MA. This workshop focuses on  &lt;b&gt;visualizing biological data&lt;/b&gt; and harbours a new community of people with  backgrounds ranging from bioinformatics and genomics to pure data  visualization. Biological visualization is a very broad field, going all  the way from &lt;b&gt;abstract data visualization&lt;/b&gt; to enable finding patterns in  data, to &lt;b&gt;outreach&lt;/b&gt; for education, creating movies to explain things like  DNA replication to the general public using Holywood-studio tools (for example &lt;a href="http://www.molecularmovies.com/"&gt;molecularmovies&lt;/a&gt; and the work by &lt;a href="http://www.wehi.edu.au/education/wehitv/"&gt;Drew Berry&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;Probably the best description of the relevance of this conference was  given in the opening keynote by &lt;b&gt;Eric Lander&lt;/b&gt;, professor at MIT, one of  the people pulling the Human Genome Project and - entre autres -  co-chair of the Council of Advisors on Science and Technology in the  Obama administration. Some of his quotes:&lt;br /&gt;- "&lt;b&gt;If all data fitted nicely within one clear paradigm, we wouldn't need  VizBi.&lt;/b&gt;"&lt;br /&gt;- "Things that are beautiful have huge communicative value"&lt;br /&gt;- "Nowhere is visualization as important as in biology, because it's the  bleeding edge right now, and has the messiest data and the messiest  problems."&lt;br /&gt;&lt;br /&gt;And my favourite: "We need to work on exchange rates; if one picture is  worth only a thousand words, we're screwed."&lt;br /&gt;&lt;br /&gt;&lt;div&gt;I can highly recommend having a look at &lt;b&gt;Tamara Munzner&lt;/b&gt;'s keynote on visualization principles, available &lt;a href="http://www.cs.ubc.ca/~tmm/talks.html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;Slides and videos of the talks will be available on the vizbi.org website.&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-5041878402389528854?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/5041878402389528854'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/5041878402389528854'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2011/03/vizbi-2011-looking-back.html' title='VizBi 2011 - looking back'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-kCMB08h84ug/Tp3Gb98G5aI/AAAAAAAADes/QWT4kFHwGwo/s72-c/poster.jpg' height='72' width='72'/></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-3371886398801521162</id><published>2010-12-13T10:28:00.003+01:00</published><updated>2010-12-13T10:30:02.001+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='openresearchcomputation'/><title type='text'>Open Research Computation - a new journal from BioMedCentral</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.openresearchcomputation.com/sites/10206/images/logo.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 266px; height: 70px;" src="http://www.openresearchcomputation.com/sites/10206/images/logo.gif" border="0" alt="" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As a colleague of mine said a couple of weeks ago: "&lt;b&gt;if you don't publish it, it didn't happen&lt;/b&gt;". Scientific publications are the currency to advance a researcher's career. Looking for a new job? You better make sure your publication list is littered with first or second author papers in good (read: high impact factor) journals. Hoping to have your tenure track lead to tenure? Idem. Publish or perish.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;Meanwhile, many &lt;b&gt;bioinformaticians&lt;/b&gt; spend huge amounts of time developing software to make genetic or genomic research possible; research that just wouldn't happen if it was not for their custom-written tools, scripts and pipelines. Unfortunately, you often need the &lt;i&gt;find&lt;/i&gt; function of your webbrowser or PDF reader to be able to pinpoint the lone bioinformatician in the author list. He's not the first or second author; he works &lt;i&gt;in function of&lt;/i&gt; work by someone else.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;Just like many others (see &lt;i&gt;e.g.&lt;/i&gt; the &lt;a href="http://altmetrics.org/manifesto/"&gt;&lt;b&gt;alt-metrics&lt;/b&gt; manifesto&lt;/a&gt;), I feel quite strongly about the limitations of impact factors for judging a researcher's contribution to science. What about those papers that are published in "lower" journals but that are actually read more? What about blogs? What about your contributions on &lt;a href="http://friendfeed.com/the-life-scientists"&gt;FriendFeed&lt;/a&gt;, &lt;a href="http://seqanswers.com/"&gt;seqanswers&lt;/a&gt;, &lt;a href="http://www.quora.com/"&gt;quora&lt;/a&gt;, ...? How about all that software you wrote and that everyone can access on &lt;a href="http://github.com/jandot"&gt;github&lt;/a&gt;? Don't you think these also have a teenie-weenie little value in the scientific discourse as well?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;It will take time to change this skewed way of appreciating a researcher's work. We'll have to fight this from different angles, and subtly. We'll have to rethink what the term "publication" means, and how we can reference each other's work, including data that we made available in repositories and contributions to discussion forums. To start this process and give credit for the work of the computational scientist, BioMed Central launches a &lt;b&gt;new journal&lt;/b&gt; today: &lt;a href="http://www.openresearchcomputation.com/"&gt;&lt;b&gt;Open Research Computation&lt;/b&gt;&lt;/a&gt; with &lt;a href="http://cameronneylon.net/"&gt;Cameron Neylon&lt;/a&gt; at the helm as the editor-in-chief (and I'm grateful for having been asked to act as an editor). The aim of this journal is to provide a venue for programmers in research, who - as Cameron says it in his accompanying blog post - "either see themselves as software engineers or as researchers who code". The focus of the journal will be on making the tools available to the wider community of research programmers, with strict guidelines on code quality, code reusability and documentation.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;I can only suggest you read &lt;a href="http://bit.ly/openrescomp"&gt;Cameron's post&lt;/a&gt; and have a look at the www.openresearchcomputation.com website. The rest is up to you. Let's give the lone bioinformatician a venue to bring his work out in the public.&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-3371886398801521162?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/3371886398801521162'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/3371886398801521162'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2010/12/open-research-computation-new-journal.html' title='Open Research Computation - a new journal from BioMedCentral'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-364853970176312080</id><published>2010-08-12T18:08:00.005+02:00</published><updated>2010-08-12T20:46:09.207+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='clojure'/><category scheme='http://www.blogger.com/atom/ns#' term='incanter'/><title type='text'>VCF, tab-delimited files and bioclojure</title><content type='html'>&lt;p&gt;A lot of the work I do involves extracting data from VCF files ("Variant Call Format"; see &lt;a href="http://bit.ly/apUbi8"&gt;http://bit.ly/apUbi8&lt;/a&gt;).  It's tab-delimited but not quite: some of the columns contains structured data rather than just a value, and the format of these columns might even be different for every single line.&lt;br /&gt;&lt;br /&gt;&lt;p&gt;An example line (with the header):&lt;br /&gt;&lt;pre&gt;#CHROM POS   ID REF ALT QUAL   FILTER INFO                                            FORMAT   SAMPLE1&lt;br /&gt;1      12345 .  A   G   249.00 0      MQ=23.66;DB;DP=89;MQ0=26;LowMQ=0.2921,0.2921,89 GT:DP:GQ 1/1:89:99.00&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;The INFO field is actually a list of tag/value pairs (except when it's just a tag), and the meaning of the data in the SAMPLE1 column is explained in the FORMAT column. Not only can different INFO tags be present on different lines, but the FORMAT can change line-by-line. Let it be clear that this is quite a bit of a pain to parse. What if I want to know the distribution of the depth at which each SNP is covered (&lt;i&gt;i.e.&lt;/i&gt; the DP tag in the INFO field)?&lt;br /&gt;&lt;br /&gt;&lt;p&gt;So the first thing anyone does before working with such files: &lt;b&gt;convert&lt;/b&gt; to "real" tab-delimited: the INFO field is spread over multiple columns and the same goes for the data in SAMPLE1. There is a &lt;i&gt;&lt;b&gt;vcf2tsv&lt;/b&gt;&lt;/i&gt; script available in vcftools, but it only extracts part of the data in the INFO field; not all tags in the INFO field will be represented in the tab-delimited output file.&lt;br /&gt;&lt;br /&gt;&lt;p&gt;I've tried to write a &lt;i&gt;&lt;b&gt;vcf2tsv&lt;/b&gt;&lt;/i&gt; method in &lt;b&gt;clojure&lt;/b&gt; (see &lt;a href="http://clojure.com"&gt;clojure.com&lt;/a&gt; and &lt;a href="http://clojure.org"&gt;clojure.org&lt;/a&gt;). I'm now able to read/write VCF files without the tsv-intermediate using &lt;b&gt;&lt;a href="http://incanter.org"&gt;incanter&lt;/a&gt;&lt;/b&gt; (an R-like statistics environment using clojure), and also have a &lt;b&gt;&lt;i&gt;vcf2tsv&lt;/i&gt;&lt;/b&gt; command-line script that converts a VCF file to its tab-delimited counterpart, but this time including &lt;i&gt;all&lt;/i&gt; the data that is in the file instead of a fixed selection of tags.&lt;br /&gt;&lt;br /&gt;&lt;p&gt;The vcf2tsv script convert the above sample data into:&lt;br /&gt;&lt;pre&gt;CHROM POS   ID REF ALT QUAL   FILTER INFO-MQ INFO-DB INFO-MQ0 INFO-LowMQ       SAMPLE1-GT SAMPLE1-DP SAMPLE1-GQ&lt;br /&gt;1     12345 .  A   G   249.00 0      23.66   1       26       0.2921,0.2921,89 1/1        89         99.00&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;So to generate that histogram of quality scores:&lt;br /&gt;&lt;pre name="code" class="lisp"&gt;(use 'bioclojure)&lt;br /&gt;(ns bioclojure)&lt;br /&gt;(def snps (load-vcf "data.vcf"))&lt;br /&gt;(with-data snps&lt;br /&gt;  (view (histogram ($ :QUAL) :nbins 50)))&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;The result:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/TGQgtMTi5fI/AAAAAAAADS8/JUWJ4cU857s/s1600/incanter.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 256px;" src="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/TGQgtMTi5fI/AAAAAAAADS8/JUWJ4cU857s/s320/incanter.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5504560605322995186" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;You can download the &lt;b&gt;bioclojure library&lt;/b&gt; (which now only has this functionality :-) from &lt;a href="http://github.com/jandot/bioclojure"&gt;github&lt;/a&gt;. That page also shows how to use the library. Feel free to fork and add new functionality; I only have time for very small incremental additions :-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-364853970176312080?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/364853970176312080'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/364853970176312080'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2010/08/vcf-tab-delimited-files-and-bioclojure.html' title='VCF, tab-delimited files and bioclojure'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_t6Ob1J7aZ0A/TGQgtMTi5fI/AAAAAAAADS8/JUWJ4cU857s/s72-c/incanter.png' height='72' width='72'/></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-5907857406183389216</id><published>2010-07-16T16:28:00.002+02:00</published><updated>2010-07-16T17:23:41.360+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='genvizlab'/><title type='text'>Postdoc position - Genomic variation discovery and visualization</title><content type='html'>&lt;div&gt;Just a short note...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;Even though my position in Leuven only starts in October, I've already been involved in writing and defending a major grant. We've set up a consortium in Leuven (&lt;b&gt;SymBioSys 2&lt;/b&gt;) consisting of 6 PIs "focusing on how individual genomic variation leads to disease through cascading effects across biological networks". This should be a good stepping stone to get my own lab running.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The grant concerns several workpackages (six in total), ranging from the analysis of the raw next-gen sequencing data, over the application of network algorithms for gene prioritization to the ultimate application in a couple of disease fields. I'll be in charge of the workpackage &lt;b&gt;"Genomic variation discovery and visualization&lt;/b&gt;". Within this WP, we will apply and improve existing methods for the discovery, annotation and prioritization of &lt;b&gt;SNPs and structural variation&lt;/b&gt; based on data available from the Leuven University hospital (HiSeq and 454). We'll develop a pipeline that can be used within and outside of the university. In addition - and I envision this to be the bigger part of the work - we'll work on &lt;b&gt;visualization&lt;/b&gt; of these data. Several issues remain in this field: data can be &lt;b&gt;too big&lt;/b&gt; to be readily visualized (&lt;i&gt;e.g.&lt;/i&gt; locations of aberrantly mapped read pairs), or &lt;b&gt;too complex&lt;/b&gt; (&lt;i&gt;e.g.&lt;/i&gt; structural variation between individuals or family structures).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So I have a &lt;a href="http://bit.ly/9XOmyU"&gt;&lt;b&gt;postdoc vacancy&lt;/b&gt;&lt;/a&gt; for someone to work on this project with a good understanding of &lt;b&gt;genetics and next-generation sequencing&lt;/b&gt;. He/she should have good &lt;b&gt;programming and statistics skill&lt;/b&gt;. Ideally, he/she will have &lt;b&gt;visualization experience&lt;/b&gt; as well or be very interested to put his/her teeth into the subject. With visualization experience, I mean for statistical purposes (&lt;i&gt;e.g.&lt;/i&gt; using R), but also more general (&lt;i&gt;e.g.&lt;/i&gt; using &lt;b&gt;Processing&lt;/b&gt;). Expect to be talking a lot about indexing methods, mapreduce, visual encoding, ...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The SymBioSys2 consortium consists of young scientists in the fields of bioinformatics, human genetics and cancer. As I'll only be starting my own group this fall the person doing the postdoc will have the chance to have a real impact on where my group will be going :-)&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-5907857406183389216?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/5907857406183389216'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/5907857406183389216'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2010/07/postdoc-position-genomic-variation.html' title='Postdoc position - Genomic variation discovery and visualization'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-5092010099640602013</id><published>2010-06-25T14:51:00.013+02:00</published><updated>2010-06-25T17:05:22.909+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='clojure'/><category scheme='http://www.blogger.com/atom/ns#' term='incanter'/><title type='text'>Encounter with incanter - about clojure, incanter and bioinformatics</title><content type='html'>&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://incanter.org/incanter-i-logo-holo.png"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 173px; height: 175px;" src="http://incanter.org/incanter-i-logo-holo.png" border="0" alt="" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;I have been a bit frustrated lately by the fact that for many of my analyses I have to write a ruby script to mangle my data first, then resort to R to add a statistic to each of the datapoints, go back to ruby to mangle the result, repeat, rinse, and finally make plots in R. Of course as a bioinformatician you're used to that and if necessary you write wrapper/pipeline scripts to handle this all for you if you know that this won't be the only time you have to do the analysis. But often you don't know. And breaking the analyses up into chunks just based on the tool you use will help you losing overview.&lt;br /&gt;&lt;br /&gt;Being relatively new to R (less than 2 years), some of its idiosyncracies keep biting me. Just last week I had a dataframe to which I wanted to add a column which was the mean of two other columns. So from&lt;br /&gt;&lt;pre&gt;1    9&lt;br /&gt;8   16&lt;br /&gt;2    4&lt;/pre&gt;&lt;br /&gt;I wanted to go to:&lt;br /&gt;&lt;pre&gt;1    9    5&lt;br /&gt;8   16   12&lt;br /&gt;2    4    3&lt;/pre&gt;&lt;br /&gt;So you'd think (at least I would):&lt;br /&gt;&lt;pre name="code"&gt;df$new_column &lt;- mean(df$first_column, df$second_column)&lt;/pre&gt;and you'd be wrong. Just try it out... But it works for adding a column that is the sum of the other two, right? Don't get me wrong: many people are &lt;i&gt;very&lt;/i&gt; good in R. Including the colleagues and ex-colleagues who I constantly ask for help. If you're reading this: you know who you are, and thank you.&lt;br /&gt;&lt;br /&gt;There has been some talk on FriendFeed lately about the &lt;b&gt;&lt;a href="http://clojure.org/"&gt;clojure&lt;/a&gt;&lt;/b&gt; programming language, which is a Lisp-inspired functional language that runs on the JVM. There is also &lt;a href="http://incanter.org/"&gt;&lt;b&gt;incanter&lt;/b&gt;&lt;/a&gt;, a "clojure-based, R-like platform for statistical computing and graphics". Apart from the fact that functional languages are said to be good at working with huge data and concurrency, I was interested because, if incanter can do what I need from R, an incanter script is basically just a clojure script. So no switching between ruby/R/ruby/R/ruby/R; just do everything in the same place.&lt;br /&gt;&lt;br /&gt;So I borrowed someone's "&lt;b&gt;Programming Clojure" book&lt;/b&gt;, got onto some websites, downloaded incanter and gave it a spin this week. And my verdict: I'm impressed. It's a bit of a warp in your brain at first - being a functional language and all - but once you get the hang of it, it is easy to write. Of course I haven't done anything really difficult yet, but &lt;p&gt;&lt;/p&gt;&lt;div&gt;You'll get used to the &lt;b&gt;prefix notation&lt;/b&gt; ("+ 1 2") instead of the infix notation ("1 + 2"). At least I did. Maybe it helped that the calculator that I have been using for more than a decade - my trusted HP 42S - also doesn't use infix: "1 2 +".&lt;br /&gt;&lt;br /&gt;So let's go for &lt;b&gt;an example&lt;/b&gt;. Let's say I have a file with SNPs. For each SNP I have the position, a quality score and the reference and alternative alleles. Suppose I want to know &lt;i&gt;what the ratio of transition over transversion is for a given set of quality score cutoffs&lt;/i&gt;. So the file looks like this:&lt;br /&gt;&lt;pre&gt;1   123    29   A   G&lt;br /&gt;1   245    93   C   G&lt;br /&gt;1   832    51   C   T&lt;br /&gt;1   1234   63   T   G&lt;br /&gt;1   2345   75   A   C&lt;br /&gt;1   8315    9   C   A&lt;br /&gt;1   9213   59   T   G&lt;br /&gt;...&lt;/pre&gt;For a list of quality score cutoffs (let's say [10,40,60,80,100]), I wanted to get the ratio of transitions to transversion. Output would look like this:&lt;br /&gt;&lt;pre&gt;10   1.8&lt;br /&gt;20   1.9&lt;br /&gt;40   2.3&lt;br /&gt;60   2.3&lt;br /&gt;80   2.4&lt;br /&gt;100  2.4&lt;/pre&gt;&lt;br /&gt;I'd be interested to know how people would do this in R. Please feel free to put solutions in the comments. I can only learn from them...&lt;br /&gt;&lt;br /&gt;But how about doing this in clojure/incanter? Being new to the language, I first wrote a script that would add "ti" or "tv" as an extra column, and another one that would calculate ti/tv for those cutoff.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Step 1: Assigning "ti" or "tv" to SNPs&lt;/b&gt;&lt;br /&gt;&lt;pre name="code" class="lisp"&gt;(use '(incanter core io charts))&lt;br /&gt;(defn ti?&lt;br /&gt;"Checks if two alleles constitute a transition"&lt;br /&gt;[allele-1 allele-2]&lt;br /&gt;(if (contains? #{"AG" "CT"} (apply str (sort [allele-1 allele-2]))) true false)&lt;br /&gt;)&lt;br /&gt;&lt;br /&gt;(defn tv?&lt;br /&gt;"Checks if two alleles constitute a transversion"&lt;br /&gt;[allele-1 allele-2]&lt;br /&gt;(if (contains? #{"AC" "CG" "GT" "AT"} (apply str (sort [allele-1 allele-2]))) true false)&lt;br /&gt;)&lt;br /&gt;&lt;br /&gt;(defn ti-or-tv&lt;br /&gt;"Returns 'ti' if two alleles are transition, otherwise returns 'tv'"&lt;br /&gt;[allele-1 allele-2]&lt;br /&gt;(if (ti? allele-1 allele-2) "ti"&lt;br /&gt; (if (tv? allele-1 allele-2) "tv" "other"))&lt;br /&gt;)&lt;br /&gt;&lt;br /&gt;(def data&lt;br /&gt;(read-dataset "data_file.tsv" :header true :delim \tab)&lt;br /&gt;)&lt;br /&gt;&lt;br /&gt;(def titv-result (map #(ti-or-tv (:ref %) (:alt %)) (:rows data)))&lt;br /&gt;&lt;br /&gt;(def data&lt;br /&gt;(col-names&lt;br /&gt; (conj-cols data titv-result)&lt;br /&gt; [:chr :pos :qual :ref :alt :titv]))&lt;br /&gt;&lt;br /&gt;(save data "data_file_with_titv.tsv")&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This creates a file "data_file_with_titv.tsv" that has an additional column that either says "ti" or "tv".&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Step 2: Calculate ti/tv using different quality score cutoffs&lt;/b&gt;&lt;br /&gt;&lt;pre name="code" class="lisp"&gt;(use '(incanter core io charts))&lt;br /&gt;&lt;br /&gt;(defn filter-qual&lt;br /&gt;"Returns only rows of a dataset where quality score is better than cutoff"&lt;br /&gt;[ds cutoff]&lt;br /&gt;(sel ds :filter #(&gt; (nth % 2) cutoff)))&lt;br /&gt;&lt;br /&gt;(defn count-ti&lt;br /&gt;"Count the number of rows that have 'ti' in the 6th column"&lt;br /&gt;[ds]&lt;br /&gt;(nrow (sel ds :filter #(= (nth % 5) "ti"))))&lt;br /&gt;&lt;br /&gt;(defn count-tv&lt;br /&gt;"Count the number of rows that have 'tv' in the 6th column"&lt;br /&gt;[ds]&lt;br /&gt;(nrow (sel ds :filter #(= (nth % 5) "tv"))))&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;(defn titv&lt;br /&gt;"Calculate the ti/tv ratio in a given dataset"&lt;br /&gt;[ds]&lt;br /&gt;(float (/ (count-ti ds) (count-tv ds))))&lt;br /&gt;&lt;br /&gt;(def filename&lt;br /&gt;"data_file_with_titv.tsv")&lt;br /&gt;&lt;br /&gt;(def data&lt;br /&gt;(read-dataset filename :header true :delim \tab)&lt;br /&gt;)&lt;br /&gt;&lt;br /&gt;(apply max (sel data :cols :qual)) ; =&gt; max qual = 527.58&lt;br /&gt;&lt;br /&gt;(def cutoffs (range 10 500 10))&lt;br /&gt;&lt;br /&gt;(def titv-values (map titv (map #(filter-qual data %) cutoffs)))&lt;br /&gt;&lt;br /&gt;(view (xy-plot cutoffs titv-values))&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;...and this nicely creates the graph that I want.&lt;br /&gt;&lt;br /&gt;The code may seem strange, but once you get used to the prefix notation thing you'll be alright.&lt;br /&gt;&lt;br /&gt;Of course there are &lt;b&gt;downsides&lt;/b&gt; in using clojure/incanter as well. Being a young language, there is no application that can come close to &lt;b&gt;R.app&lt;/b&gt; yet. There are plugins for eclipse, netbeans, vi and emacs, but I had trouble installing any of these. There is a one-click incanter Mac application, but it does not provide the same user-friendliness as R.app. My current (well: these few days) setup: I start an interactive clojure/incanter session &lt;pre&gt;java -Xmx800m -cp /Library/Clojure/lib jline.ConsoleRunner clojure.main&lt;/pre&gt; and have a TextMate window open. I just write my script in TextMate and copy/paste to the command line. Not ideal, but workable.&lt;br /&gt;&lt;br /&gt;I also don't like having to refer to &lt;b&gt;column numbers&lt;/b&gt; in some occasions instead of always being able to refer to column names (see the "&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;(nth % 2)&lt;/span&gt;" bits in the code). But this might be me: you can probably do that but I just don't know how yet.&lt;br /&gt;&lt;br /&gt;Overall, I'd say that my &lt;b&gt;first encounter with incanter has been a great success&lt;/b&gt;. The clojure language seems to be very &lt;b&gt;consistent&lt;/b&gt; which greatly helps picking it up. You have to make a &lt;i&gt;click&lt;/i&gt; in your head, but I'm sure any bioinformatician is easily capable of doing that. I would greatly recommend giving clojure and incanter a try. Get the book. Browse the web. Rewrite one of your small R scripts in incanter. Response on the incanter Google group is also very fast, including from the developer of incanter David Liebke (&lt;a href="http://twitter.com/liebke"&gt;@liebke&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;Further information can be found in the usual places: &lt;a href="http://clojure.org/"&gt;clojure.org&lt;/a&gt;, &lt;a href="http://incanter.org/"&gt;incanter.org&lt;/a&gt; and google. Also check out &lt;a href="http://data-sorcery.org/"&gt;data-sorcery.org&lt;/a&gt;, the incanter blog.&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Update&lt;/b&gt;: Tim Yates has created an R version of the incanter solution above. I'm impressed. There's no way I would have been able to write that... He's now in my "R guru" box. See &lt;a href="https://gist.github.com/31ccce346f1a588acedd"&gt;here&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-5092010099640602013?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/5092010099640602013'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/5092010099640602013'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2010/06/encounter-with-incanter-about-clojure.html' title='Encounter with incanter - about clojure, incanter and bioinformatics'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-2801975997035702429</id><published>2010-05-19T14:00:00.000+02:00</published><updated>2010-05-19T14:51:33.860+02:00</updated><title type='text'>Threads in ruby: probably not how to use them</title><content type='html'>I should create an online labbook with code examples of how I do things. Keep going back to an example script I have to copy/paste the code for handling different &lt;b&gt;threads in ruby&lt;/b&gt;. But I'll put it here for the moment :-)&lt;br /&gt;&lt;br /&gt;Suppose I have a file with several millions of lines containing information on SNPs. And suppose I have a database that already contains data for those SNP. And suppose I want to update the entries in the database with the data from the input file.&lt;br /&gt;&lt;br /&gt;Please note: this is a &lt;b&gt;&lt;i&gt;quick hack&lt;/i&gt;&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;require 'rubygems'&lt;br /&gt;require 'progressbar'&lt;br /&gt;&lt;br /&gt;MAX_NR_OF_THREADS = 5&lt;br /&gt;nr_of_lines = `wc -l input_file.tsv`.split(/ /)[0].to_i&lt;br /&gt;pbar = ProgressBar.new('processing', nr_of_lines.to_f/MAX_NR_OF_THREADS)&lt;br /&gt;File.open(input_file.tsv).each_slice(MAX_NR_OF_THREADS) do |slice|&lt;br /&gt;  pbar.inc&lt;br /&gt;  threads = Hash.new&lt;br /&gt;  slice.each do |line|&lt;br /&gt;    threads[line] = Thread.new do&lt;br /&gt;      # do the actual line parsing, DB lookup and DB updates&lt;br /&gt;    end&lt;br /&gt;  end&lt;br /&gt;  threads.values.each do |thread|&lt;br /&gt;    thread.join&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;pbar.finish&lt;/pre&gt;&lt;br /&gt;I know this is far from perfect:&lt;div&gt;&lt;ul&gt;&lt;li&gt;I shouldn't need to create that array.&lt;/li&gt;&lt;li&gt;This way all concurrent threads wait for each other before the next slice is taken from the input file. If one of the 5 threads takes a really long time, the other ones will wait but could instead start parsing the next lines in the input file.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Don't think less of me for this code...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-2801975997035702429?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/2801975997035702429/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2009/01/threads-in-ruby-probably-not-how-to-use.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/2801975997035702429'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/2801975997035702429'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2009/01/threads-in-ruby-probably-not-how-to-use.html' title='Threads in ruby: probably not how to use them'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-9096713927330130616</id><published>2010-01-07T11:31:00.004+01:00</published><updated>2010-01-07T12:50:54.285+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='mongodb'/><title type='text'>Tipping my toes in mongodb with ruby</title><content type='html'>Read the &lt;a href="http://nsaunders.wordpress.com/2010/01/07/how-to-archive-data-via-an-api-using-ruby-and-mongodb/"&gt;excellent post&lt;/a&gt; by Neil Saunders on using ruby and mongodb to archive his posts on FriendFeed, prompting me to finally write down my own experiences with mongodb. So here goes...&lt;br /&gt;&lt;br /&gt;Let's have a look at the pilot SNP data from the &lt;a href="http://www.1000genomes.org/"&gt;1000genomes&lt;/a&gt; project. The data released in April 2009 contain lists of SNPs from a low-coverage sequencing effort in the CEU (European descent), YRI (African) and JPTCHB (Asian) populations. SNPs can be dowloaded from &lt;a href="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2009_04/"&gt;here&lt;/a&gt;; get the files called something.sites.2009_04.gz. The exercise that we'll be performing here, is to get an idea of &lt;span style="font-weight: bold;"&gt;how many SNPs are in common between those populations&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The input data&lt;/span&gt;&lt;br /&gt;The input data contains chromosome, position, reference allele (based on reference sequence), alternative allele and allele frequency in that population. Some sample lines from CEU.sites.2009_04.gz:&lt;br /&gt;&lt;pre&gt;1 223336 C G 0.017544&lt;br /&gt;1 224176 C T 0.052632&lt;br /&gt;1 224344 T A 0.824561&lt;br /&gt;1 224419 C T 0.008772&lt;br /&gt;1 224438 C T 0.412281&lt;br /&gt;1 224472 T C 0.122807&lt;/pre&gt;&lt;span style="font-weight: bold;"&gt;What will it look like in a database?&lt;/span&gt;&lt;br /&gt;If we would load this into a &lt;span style="font-weight: bold;"&gt;relational database&lt;/span&gt;, we could create a table with the following columns: chromosome, position, ceu_ref, ceu_alt, ceu_maf, yri_ref, yri_alt, yri_maf, jptchb_ref, jptchb_alt, jptchb_maf. However we would end up with a &lt;span style="font-style: italic;"&gt;lot&lt;/span&gt; of NULLs in that table.&lt;br /&gt;&lt;br /&gt;For example: there is a SNP at position 522,311 on chromosome one in the CEU and YRI populations, but not in the JPTCHB population.&lt;br /&gt;&lt;span style=";font-family:courier new;font-size:78%;"  &gt;&lt;pre&gt;chr  pos      ceu_maj   ceu_min   ceu_maf   yri_maj   yri_min   yri_maf   jptchb_maj  jptchb_min  jptchb_maf&lt;br /&gt;1    223336   C         G         0.01754   G         C         0.47321   C           G            0.22034&lt;br /&gt;1    522311   C         A         0.05263   C         A         0.33036   NULL        NULL         NULL&lt;/pre&gt;&lt;/span&gt;(In this example, I have already recoded ref/alt allele to major/minor allele. In some cases such as the SNP at 223,336 the major allele is not the same in each population.)&lt;br /&gt;&lt;br /&gt;Actually, of the 21,742,359 SNPs, there are only 4,836,814 (about 22%) is present in each of these populations. Moreover, there are 13,957,866 (64%!) SNPs that are present in only one of the populations. Ergo: loads of NULLs.&lt;br /&gt;&lt;br /&gt;This is where you can start thinking of using a &lt;span style="font-weight: bold;"&gt;document-oriented database&lt;/span&gt; for storing these SNP data: each document will be tailored to a specific SNP and will e.g. not refer to the JPTCHB population if it it not present in that population. Enter &lt;a href="http://www.mongodb.org/"&gt;&lt;span style="font-weight: bold;"&gt;mongodb&lt;/span&gt;&lt;/a&gt;. Apparently the "mongo" comes from "hu&lt;span style="font-style: italic;"&gt;mongo&lt;/span&gt;uos". I hope it will live up to that name...&lt;br /&gt;&lt;br /&gt;The above two SNPs could be represented in two json documents like this:&lt;br /&gt;&lt;pre&gt;{&lt;br /&gt;  "_id" : "1_223336",&lt;br /&gt;  "chr" : "1",&lt;br /&gt;  "pos" : 223336,&lt;br /&gt;  "kg" : {&lt;br /&gt;    "ceu" : {&lt;br /&gt;      "major" : "C",&lt;br /&gt;      "minor" : "G",&lt;br /&gt;      "maf" : 0.017544&lt;br /&gt;    },&lt;br /&gt;    "yri" : {&lt;br /&gt;      "maf" : 0.473214,&lt;br /&gt;      "major" : "G",&lt;br /&gt;      "minor" : "C"&lt;br /&gt;    },&lt;br /&gt;    "jptchb" : {&lt;br /&gt;      "maf" : 0.220339,&lt;br /&gt;      "major" : "C",&lt;br /&gt;      "minor" : "G"&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;{&lt;br /&gt;  "_id" : "1_522311",&lt;br /&gt;  "chr" : "1",&lt;br /&gt;  "pos" : 522311,&lt;br /&gt;  "kg" : {&lt;br /&gt;    "ceu" : {&lt;br /&gt;      "major" : "C",&lt;br /&gt;      "minor" : "A",&lt;br /&gt;      "maf" : 0.05263&lt;br /&gt;    },&lt;br /&gt;    "yri" : {&lt;br /&gt;      "maf" : 0.33036,&lt;br /&gt;      "major" : "D",&lt;br /&gt;      "minor" : "A"&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;}&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Getting the data in the database&lt;/span&gt;&lt;br /&gt;So how do we load this data into a mongo database? You'll obviously need a mongo daemon running, but I'll leave that as an exercise to the reader as it's very easy and well-explained on the mongodb website. My approach was to first import all CEU SNPs straight from a json-formatted file, and then parse the YRI and JPTCHB files&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-weight: bold;"&gt;1. Loading the CEU SNPs&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;The mongoimport tool takes a json-formatted file and - guess what - imports it into a mongo database. Just reformat the lines in the original CEU.sites.2009_04 to json (see below) and run &lt;span style="font-family:courier new;"&gt;mongoimport --db test --collection snps --file ceu.json&lt;/span&gt;. Ceu.json looks like this:&lt;br /&gt;&lt;pre&gt;{_id: "1_211", chr: "1", pos: 211, kg: {ceu: {major: "A", minor: "G", maf: 0.201754}}}&lt;br /&gt;{_id: "1_216", chr: "1", pos: 216, kg: {ceu: {major: "T", minor: "A", maf: 0.035088}}}&lt;br /&gt;{_id: "1_229", chr: "1", pos: 229, kg: {ceu: {major: "A", minor: "C", maf: 0.017544}}}&lt;/pre&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-weight: bold;"&gt;2. Loading the YRI and JPTCHB SNPs&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Loading the YRI and JPTCHB SNPs is a bit trickier because we need to check if the SNP is already defined. If it's not we'll just add a new document, but if it is we have to &lt;span style="font-weight: bold;"&gt;update&lt;/span&gt; the existing document and add the population-specific details. For this I use a ruby script with the &lt;span style="font-weight: bold;"&gt;mongo gem&lt;/span&gt; (I already added an additional column to the input files containing a unique ID per SNP consisting of chromosome and position):&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;require 'rubygems'&lt;br /&gt;require 'mongo'&lt;br /&gt;require 'progressbar'&lt;br /&gt;&lt;br /&gt;db = Mongo::Connection.new.db("test")&lt;br /&gt;coll = db.collection('snps')&lt;br /&gt;&lt;br /&gt;pbar = ProgressBar.new('processing', 13759844)&lt;br /&gt;File.open('YRI.sites.2009_04.with_uid').each do |line|&lt;br /&gt;  pbar.inc&lt;br /&gt;  uid, chr, pos, allele1, allele2, allele2_freq = line.chomp.split(/\t/)&lt;br /&gt;  allele2_freq = allele2_freq.to_f&lt;br /&gt;  major, minor, maf = nil, nil, nil&lt;br /&gt;  if allele2_freq.to_f &lt;= 0.5&lt;br /&gt;    major, minor, maf = allele1, allele2, allele2_freq&lt;br /&gt;  else&lt;br /&gt;    major, minor, maf = allele2, allele1, 1 - allele2_freq&lt;br /&gt;  end&lt;br /&gt;  &lt;br /&gt;  snp = coll.find_one(:_id =&gt; uid)&lt;br /&gt;  if snp.nil?&lt;br /&gt;    snp = {:_id =&gt; uid, :chr =&gt; chr, :pos =&gt; pos.to_i, :kg =&gt; {:yri =&gt; {:major =&gt; major, :minor =&gt; minor, :maf =&gt; maf}}}&lt;br /&gt;  else&lt;br /&gt;    kg_data = snp['kg']&lt;br /&gt;    kg_data['yri'] = {:major =&gt; major, :minor =&gt; minor, :maf =&gt; maf}&lt;br /&gt;  end&lt;br /&gt;  &lt;br /&gt;  coll.save(snp)&lt;br /&gt;end&lt;br /&gt;pbar.finish&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;For each SNP, we first get the major and minor allele as well as the minor allele frequency based on the alternative allele frequency. Next, we search the entire SNP collection for the SNP ID (which consists of chromosome_position). If it doesn't exist, we create a completely new document, otherwise we fetch the "kg" data ("kg" stands for "1000genomes") and add a 'yri' key-value pair.&lt;br /&gt;&lt;br /&gt;Do the same for JPTCHB and you should get a complete collection.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Summarizing the data&lt;/span&gt;&lt;br /&gt;Going back to the question we're asking: how many SNPs are in common among these populations? Again, a ruby script:&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;require 'rubygems'&lt;br /&gt;require 'mongo'&lt;br /&gt;&lt;br /&gt;map = "function() { " +&lt;br /&gt; "var keys = [];" +&lt;br /&gt; "for ( item in this['kg'] ) { keys.push(item) }" +&lt;br /&gt; "emit(keys.sort().join(';'), {count: 1})" +&lt;br /&gt;"}"&lt;br /&gt;reduce = "function(key, values) { " +&lt;br /&gt; "var sum = 0; " +&lt;br /&gt; "values.forEach(function(doc) { " +&lt;br /&gt; " sum += doc.count; " +&lt;br /&gt; "}); " +&lt;br /&gt; "return {count: sum}; " +&lt;br /&gt;"};"&lt;br /&gt;&lt;br /&gt;db = Mongo::Connection.new.db("test")&lt;br /&gt;coll = db.collection("snps")&lt;br /&gt;&lt;br /&gt;result = coll.map_reduce(map, reduce)&lt;br /&gt;result.find.to_a.each do |r|&lt;br /&gt;  puts ['{', r['_id'], ':', r['value']['count'].to_i, '}'].join(" ")&lt;br /&gt;end&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;This script takes 50 minutes to run using a mongo database on my MacBook laptop.&lt;br /&gt;&lt;br /&gt;Counting things in a collection makes at least &lt;span style="font-style: italic;"&gt;me&lt;/span&gt; think of &lt;span style="font-weight: bold;"&gt;mapreduce&lt;/span&gt;. And I had seen mapreduce mentioned on the mongodb website, so clearly wanted to give that a go. Unfortunately, you have to define the map and reduce functions in javascript, which is a bit unsightly within a ruby script, but so be it. For more info, see &lt;a href="http://kylebanker.com/blog/2009/12/mongodb-map-reduce-basics/"&gt;this&lt;/a&gt; blog post by Kyle Banker.&lt;br /&gt;&lt;br /&gt;This shows us the following numbers of SNPs:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;CEU alone: 2,954,418&lt;/li&gt;&lt;li&gt;YRI alone: 6,901,757&lt;/li&gt;&lt;li&gt;JPTCHB alone: 4,101,691&lt;/li&gt;&lt;li&gt;CEU/YRI: 915,476&lt;/li&gt;&lt;li&gt;CEU/JPTCHB: 926,406&lt;/li&gt;&lt;li&gt;YRI/JPTCHB: 1,105,797&lt;/li&gt;&lt;li&gt;all three: 4,836,814&lt;/li&gt;&lt;/ul&gt;As Pierre would say: that's it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-9096713927330130616?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/9096713927330130616'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/9096713927330130616'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2010/01/tipping-my-toes-in-mongodb-with-ruby.html' title='Tipping my toes in mongodb with ruby'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-7495487871463271793</id><published>2009-09-18T12:31:00.006+02:00</published><updated>2009-09-18T13:11:51.771+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mapreduce'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Trying out mapreduce - on the farm</title><content type='html'>Received an email this week from Sanger helpdesk that they installed a test hadoop system on the farm with 2 nodes. Thanks guys! First thing to do, obviously, was to repeat the &lt;span style="font-weight: bold;"&gt;streaming mapreduce&lt;/span&gt; exercise I did on my own machine (see my &lt;a href="http://saaientist.blogspot.com/2009/09/trying-out-mapreduce.html"&gt;previous post&lt;/a&gt;). Only difference with my local setup is that this time I had to handle HDFS.&lt;br /&gt;&lt;br /&gt;As a recap from my previous post: I'll be running an equivalent of the following:&lt;br /&gt;&lt;pre&gt;cat snps.txt | ruby snp_mapper.rb | sort | ruby snp_reducer.rb&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Setting up&lt;/span&gt;&lt;br /&gt;First thing the Sanger wiki told me was to format my HDFS space:&lt;pre&gt;hadoop namenode -format&lt;/pre&gt; This apparently only affects my own space... After that I could start playing with hadoop.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Where am I?&lt;/span&gt;&lt;br /&gt;It looks like hadoop installs its own complete filesystem: even if my path on the server would be /home/users/a/aerts, a &lt;pre&gt;hadoop fs -lsr /&lt;/pre&gt; shows that in the HDFS system I'm at /user/aerts.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Preparing the run&lt;/span&gt;&lt;br /&gt;First off: create a directory to work in. Let's call that "locustree" with two subdirectories, called "input" and "output".&lt;br /&gt;&lt;pre&gt;hadoop fs -mkdir locustree&lt;br /&gt;hadoop fs -mkdir locustree/input&lt;br /&gt;hadoop fs -mkdir locustree/output&lt;/pre&gt;And copy your datafile to the input directory:&lt;br /&gt;&lt;pre&gt;hadoop fs -put snps.txt locustree/input/&lt;/pre&gt;&lt;span style="font-weight: bold;"&gt;Running the run&lt;/span&gt;&lt;br /&gt;Hadoop documentation mentions that any shebang line in the mapper and reducer scripts are likely to not work, so you have to call ruby explicitely. Provide both scripts as "-file" arguments to hadoop. Finally, from your local directory containing the scripts, run the following &lt;span style="font-weight: bold;"&gt;command&lt;/span&gt; to run the mapreduce job:&lt;br /&gt;&lt;pre&gt;hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1-streaming.jar \&lt;br /&gt;  -input locustree/input/snps.txt \&lt;br /&gt;  -mapper "/usr/local/bin/ruby snp_mapper.rb" \&lt;br /&gt;  -reducer "/usr/local/bin/ruby snp_reducer.rb" \&lt;br /&gt;  -output locustree/output/snp_index \&lt;br /&gt;  -file snp_mapper.rb \&lt;br /&gt;  -file snp_reducer.rb&lt;/pre&gt;&lt;br /&gt;Et voila: a new directory snp_index now contains the file spit out by the snp_reducer.rb script.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-7495487871463271793?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/7495487871463271793'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/7495487871463271793'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2009/09/trying-out-mapreduce-on-farm.html' title='Trying out mapreduce - on the farm'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-2572684368689407602</id><published>2009-09-02T19:50:00.000+02:00</published><updated>2009-09-02T19:50:29.669+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mapreduce'/><category scheme='http://www.blogger.com/atom/ns#' term='aws'/><title type='text'>Trying out mapreduce</title><content type='html'>&lt;a href="http://www.flickr.com/photos/antichrist/3427853501/" title="learning to map/reduce by [niv], on Flickr"&gt;&lt;img src="http://farm4.static.flickr.com/3605/3427853501_a06b608439.jpg" alt="learning to map/reduce" height="375" width="500" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;i&gt;Photo by niv available from Flickr&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;I have long been interested in trying out &lt;span style="font-weight: bold;"&gt;mapreduce&lt;/span&gt; in my data pipelines. The Wellcome Trust Sanger Institute has several huge compute farms that I normally use, but  they don't support mapreduce jobs. Quite understandable from the IT's and institute's point of view because it's a mammoth task to keep those things running. But it also means that I can't put a foot on the mapreduce path.&lt;br /&gt;&lt;br /&gt;There are options to run mapreduce on your own, however. One is to &lt;span style="font-weight: bold;"&gt;install &lt;/span&gt;&lt;span style="font-weight: bold;"&gt;hadoop&lt;/span&gt;&lt;span style="font-weight: bold;"&gt; locally&lt;/span&gt; with the restriction that you can only use one node: your own machine. &lt;a href="http://www.raja-gopal.com/?p=42"&gt;This walkthrough&lt;/a&gt; will get you started with that. Another approach is to use &lt;a style="font-weight: bold;" href="http://aws.amazon.com/elasticmapreduce/"&gt;Amazon Elastic MapReduce&lt;/a&gt; (AWS EMR).&lt;br /&gt;&lt;br /&gt;First of, I must admit that I'm &lt;span style="font-weight: bold;"&gt;far from familiar with mapreduce&lt;/span&gt; at this point in time. As far as I understand it's ideal for aggregating data when the algorithm can be run on a subset of the data as well as the complete dataset. The &lt;span style="font-weight: bold;"&gt;mapping step&lt;/span&gt; of the framework will create different subsets of that data, run the required program on it and return partial results. These results should be in the form of &lt;span style="font-weight: bold;"&gt;key/value-pairs&lt;/span&gt; and will serve as the input for a &lt;span style="font-weight: bold;"&gt;reduce step&lt;/span&gt; that combines all intermediate results into one.&lt;br /&gt;&lt;br /&gt;What's held me back from using AWS EMR is the cost. It's practically nothing, but anything I run on their servers at this moment is paid for with my own money. So today I installed hadoop locally and started coding.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:180%;" &gt;The pilot&lt;/span&gt;&lt;br /&gt;One of the projects I'm working on is &lt;a style="font-weight: bold;" href="http://github.com/jandot/locustree"&gt;LocusTree&lt;/a&gt;: a way of indexing genomic features that allows for quick visualization at different resolutions. In this data structure the genome is first divided into bins of 4,000 bp (called level 0). The next step sees the creation of new "parent" bins that each hold 2 of the original ones (so each covering 8,000 bp; called level 1). You keep on grouping 2 bins into a parent bin until you end up with a single bin for a complete chromosome. For chromosome 1 that means 17 levels, I think.&lt;br /&gt;&lt;br /&gt;Each of the features - for example a list of 15 million SNPs or log2ratios for 42 million probes - is put in the smallest node in which it fits entirely. A feature ranging from position 10 to 20 will fit in the first bin on level 0; but a feature from position 3,990 to 4,050 will have to go in the first bin on level 1. Subsequently a count and if applicable some aggregate value like sum, min or max is stored in that node as well as in every parent node.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/Sp6Xj5T90MI/AAAAAAAADJo/SaPGx2wkIRQ/s1600-h/locustree.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 300px;" src="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/Sp6Xj5T90MI/AAAAAAAADJo/SaPGx2wkIRQ/s400/locustree.png" alt="" id="BLOGGER_PHOTO_ID_5376901648062730434" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;So for each SNP/probe the indexing script has to find the smallest enclosing node and update the aggregate values for each parent up to level 3.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:180%;"&gt;&lt;span style="font-weight: bold;"&gt;The setup&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;I'm not good at java so used the &lt;span style="font-weight: bold;"&gt;hadoop streaming&lt;/span&gt; approach for this indexing. The input file contains about 15 million SNPs and looks like this:&lt;pre&gt;snp6216929   3   177718011   177718021&lt;br /&gt;snp6216930   3   177718267   177718277&lt;br /&gt;snp6216931   3   177718268   177718278&lt;br /&gt;snp6216932   3   177718294   177718304&lt;br /&gt;snp6216933   3   177718612   177718622&lt;br /&gt;snp6216934   3   177718629   177718639&lt;br /&gt;snp6216935   3   177718956   177718966&lt;br /&gt;snp6216936   3   177719529   177719539&lt;br /&gt;snp6216937   3   177719623   177719633&lt;/pre&gt;I want to get a list of nodes with the count of SNPs within that node and all child-nodes. Each node should also list the SNP names for which that node is the smallest enclosing node. So the output looks like:&lt;pre&gt;3:0:44429   9   snp6216929,snp6216930,snp6216931,snp6216932,snp6216933,snp6216934,snp6216935,snp6216936,snp6216937&lt;br /&gt;3:0:44430   1   snp6216938&lt;br /&gt;3:1:22214   9&lt;br /&gt;3:1:22215   1&lt;br /&gt;3:2:11107   10&lt;br /&gt;3:3:5553    10&lt;br /&gt;3:4:2776    10&lt;br /&gt;3:5:1388    10&lt;br /&gt;3:6:694     10&lt;br /&gt;3:7:347     10&lt;br /&gt;3:8:173     10&lt;br /&gt;3:9:86      10&lt;br /&gt;3:10:43     10&lt;br /&gt;3:11:21     10&lt;br /&gt;3:12:10     10&lt;br /&gt;3:13:5      10&lt;br /&gt;3:14:2      10&lt;br /&gt;3:15:1      10&lt;br /&gt;3:16:0      10&lt;br /&gt;3:17:0      10&lt;/pre&gt;First column is the node (chromosome - level number - start position), second column is the number of SNPs covered by that node, and third column lists the SNP names for the smallest enclosing nodes.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:180%;"&gt;&lt;span style="font-weight: bold;"&gt;The scripts&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Code for the &lt;span style="font-weight: bold;"&gt;mapper&lt;/span&gt;:&lt;pre name="code" class="ruby"&gt;#!/usr/bin/ruby&lt;br /&gt;CHROMOSOME_LENGTHS = {'1' =&gt; 247249719,&lt;br /&gt;                    '2' =&gt; 242951149,&lt;br /&gt;                    '3' =&gt; 199501827,&lt;br /&gt;                    '4' =&gt; 191273063,&lt;br /&gt;                    '5' =&gt; 180857866,&lt;br /&gt;                    '6' =&gt; 170899992,&lt;br /&gt;                    '7' =&gt; 158821424,&lt;br /&gt;                    '8' =&gt; 146274826,&lt;br /&gt;                    '9' =&gt; 140273252,&lt;br /&gt;                    '10' =&gt; 135374737,&lt;br /&gt;                    '11' =&gt; 134452384,&lt;br /&gt;                    '12' =&gt; 132349534,&lt;br /&gt;                    '13' =&gt; 114142980,&lt;br /&gt;                    '14' =&gt; 106368585,&lt;br /&gt;                    '15' =&gt; 100338915,&lt;br /&gt;                    '16' =&gt; 88827254,&lt;br /&gt;                    '17' =&gt; 78774742,&lt;br /&gt;                    '18' =&gt; 76117153,&lt;br /&gt;                    '19' =&gt; 63811651,&lt;br /&gt;                    '20' =&gt; 62435964,&lt;br /&gt;                    '21' =&gt; 46944323,&lt;br /&gt;                    '22' =&gt; 49691432,&lt;br /&gt;                    '23' =&gt; 154913754,&lt;br /&gt;                    '24' =&gt; 57772954&lt;br /&gt;                   }&lt;br /&gt;&lt;br /&gt;def enclosing_node(chr,start,stop)&lt;br /&gt;start = start.to_i&lt;br /&gt;stop = stop.to_i&lt;br /&gt;level_number = ((Math.log(CHROMOSOME_LENGTHS[chr]) - Math.log(4000)).to_f/Math.log(2)).floor + 1&lt;br /&gt;resolution_at_level = 4000*(2**level_number)&lt;br /&gt;previous_start_bin = 0&lt;br /&gt;previous_level_number = level_number&lt;br /&gt;start_bin = (start-1).div(resolution_at_level)&lt;br /&gt;stop_bin = (stop-1).div(resolution_at_level)&lt;br /&gt;&lt;br /&gt;ancestors = Array.new&lt;br /&gt;while start_bin == stop_bin and level_number &gt;= 0&lt;br /&gt;  ancestors.push([chr, level_number + 1, previous_start_bin].join(":"))&lt;br /&gt;  previous_start_bin = start_bin&lt;br /&gt;  previous_level_number = level_number&lt;br /&gt;  level_number -= 1&lt;br /&gt;  resolution_at_level = 4000*(2**level_number)&lt;br /&gt;  start_bin = (start-1).div(resolution_at_level)&lt;br /&gt;  stop_bin = (stop-1).div(resolution_at_level)&lt;br /&gt;end&lt;br /&gt;return [[chr, level_number + 1, previous_start_bin].join(":"), ancestors.reverse].flatten&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;STDIN.each do |line|&lt;br /&gt;name, chr, start, stop = line.chomp.split(/\t/)&lt;br /&gt;&lt;br /&gt;encl_node, *ancestors = enclosing_node(chr,start,stop)&lt;br /&gt;STDOUT.puts [encl_node, 1, name].join("\t")&lt;br /&gt;ancestors.each do |node|&lt;br /&gt;  STDOUT.puts [node, 1].join("\t")&lt;br /&gt;end&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;Don't worry about the details. It basically takes the input from standard input and writes this to standard output:&lt;pre&gt;3:0:44429   1   snp6216929&lt;br /&gt;3:1:22214   1&lt;br /&gt;3:2:11107   1&lt;br /&gt;3:3:5553    1&lt;br /&gt;3:4:2776    1&lt;br /&gt;3:5:1388    1&lt;br /&gt;3:6:694     1&lt;br /&gt;3:7:347     1&lt;br /&gt;3:8:173     1&lt;br /&gt;3:9:86      1&lt;br /&gt;3:10:43     1&lt;br /&gt;3:11:21     1&lt;br /&gt;3:12:10     1&lt;br /&gt;3:13:5      1&lt;br /&gt;3:14:2      1&lt;br /&gt;3:15:1      1&lt;br /&gt;3:16:0      1&lt;br /&gt;3:17:0      1&lt;br /&gt;3:0:44429   1   snp6216930&lt;br /&gt;3:1:22214   1&lt;br /&gt;3:2:11107   1&lt;br /&gt;3:3:5553    1&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;The &lt;span style="font-weight: bold;"&gt;reducer script&lt;/span&gt; takes all this and summarizes it all:&lt;pre name="code" class="ruby"&gt;#!/usr/bin/ruby&lt;br /&gt;nodes = Hash.new&lt;br /&gt;STDIN.each do |line|&lt;br /&gt;node, count, ids = line.chomp.split(/\t/)&lt;br /&gt;parsed_ids = ids.split(/,/) unless ids.nil?&lt;br /&gt;if nodes.has_key?(node)&lt;br /&gt;  nodes[node][:count] += count.to_i&lt;br /&gt;  nodes[node][:accessions].push(parsed_ids) unless ids.nil?&lt;br /&gt;else&lt;br /&gt;  nodes[node] = Hash.new&lt;br /&gt;  nodes[node][:count] = count.to_i&lt;br /&gt;  nodes[node][:accessions] = Array.new&lt;br /&gt;  nodes[node][:accessions].push(parsed_ids) unless ids.nil?&lt;br /&gt;end&lt;br /&gt;end&lt;br /&gt;nodes.keys.each do |locus|&lt;br /&gt;STDOUT.puts [locus, nodes[locus][:count], nodes[locus][:accessions].flatten.join(',')].join("\t")&lt;br /&gt;end&lt;/pre&gt;Before running this under hadoop it's usefull to just manually pipe these together on the unix command line:&lt;pre&gt;cat snps.txt | ruby snp_mapper.rb | ruby snp_reducer.rb | sort&lt;/pre&gt;&lt;br /&gt;Because this was the output I expected I could run the pipeline with hadoop:&lt;pre&gt;$HADOOP_HOME/bin/hadoop&lt;br /&gt; jar $HADOOP_HOME/contrib/streaming/hadoop-0.18.3-streaming.jar&lt;br /&gt; -input snps.txt&lt;br /&gt; -mapper snp_mapper.rb&lt;br /&gt; -reducer snp_reducer.rb&lt;br /&gt; -output ./snp_index/&lt;br /&gt; -file snp_mapper.rb&lt;br /&gt; -file snp_reducer.rb&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;...and lo-and-behold: the snp_index directory contains the result I want.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:180%;"&gt;&lt;span style="font-weight: bold;"&gt;Parting thoughts&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;This is the very first time I use mapreduce so the code is far from optimized. Note that the code above is &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; a reflection of my programming skills :-) I just wanted to focus on the hadoop pipeline rather than the code. Actually, running this thing on my own laptop lasted almost 4 hours and nearly brought it to its knees. It certainly beat it up badly: that hash in the reducer script is just way too big... This little mapreduce experiment is finished, but I might have another go later and optimize the mapper and reducer scripts (getting rid of that &lt;span style="font-style: italic;"&gt;huge&lt;/span&gt; hash), or create some intermediate reducer step.&lt;br /&gt;&lt;br /&gt;Unfortunately it's only using one compute node (my own laptop) so I couldn't check how fast it can be. Running a small subset on Amazon Elastic MapReduce shows that that should work as well.&lt;br /&gt;&lt;br /&gt;I only kept track of the count as an aggregate in the nodes. In case I wanted to calculate sum, min and max, I'd probably use a json-formatted string in the key/value pairs:&lt;pre&gt;{"count": 4, "sum": 17, "min": 2, "max": 12}&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;I will definitely use mapreduce in the future, given the premise that my work would either install it on a cluster locally or we get an institute account on AWS.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-2572684368689407602?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/2572684368689407602'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/2572684368689407602'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2009/09/trying-out-mapreduce.html' title='Trying out mapreduce'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm4.static.flickr.com/3605/3427853501_a06b608439_t.jpg' height='72' width='72'/></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-5829492768640081519</id><published>2009-07-08T19:18:00.015+02:00</published><updated>2009-07-10T10:51:09.863+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='pARP'/><title type='text'>First test release of circular genome browser</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_t6Ob1J7aZ0A/SlXIA27X9_I/AAAAAAAACxg/qQs7YrYcE84/s1600-h/parp20090708.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 262px;" src="http://2.bp.blogspot.com/_t6Ob1J7aZ0A/SlXIA27X9_I/AAAAAAAACxg/qQs7YrYcE84/s400/parp20090708.png" alt="" id="BLOGGER_PHOTO_ID_5356407248897177586" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Worked a couple of days on &lt;a href="http://wiki.github.com/jandot/parp"&gt;pARP&lt;/a&gt;, the circular genome browser, and I think it's ready to be tested out by others. Consider this an alpha release: expect a &lt;span style="font-style: italic;"&gt;lot&lt;/span&gt; of issues. It's easy to create regions with a negative length, for example. Also, I didn't focus yet on user-friendliness or general input files. Ways of interaction are not made clear to new users yet and the input files still need to have fixed names and be stored in a particular folder.&lt;br /&gt;&lt;br /&gt;pARP is designed to be a &lt;span style="font-weight: bold;"&gt;genome browser for features that are linked to other features&lt;/span&gt; on a genome (e.g. readpair mappings). Using a circular display, lines can be drawn connecting these features.&lt;br /&gt;&lt;br /&gt;pARP always shows the &lt;span style="font-weight: bold;"&gt;whole genome&lt;/span&gt;. You can zoom into selected regions but the rest is still shown albeit squeezed a bit more together. The reason for this is that I want to show the &lt;span style="font-weight: bold;"&gt;context&lt;/span&gt; at all times. Suppose you'd zoom into two regions A and B that are linked by a large number of readpairs. If the part of the genome that is &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; A or B is not shown any readpair that has only one of its reads in A or B will just not be shown. By showing the whole genome, even squeezed in a few pixels, you can at least see that some reads are linked outside of A and B.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/SlXhj0O6ioI/AAAAAAAACxw/GsBBwg44iKQ/s1600-h/keeping_context.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 274px;" src="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/SlXhj0O6ioI/AAAAAAAACxw/GsBBwg44iKQ/s400/keeping_context.png" alt="" id="BLOGGER_PHOTO_ID_5356435337259944578" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I've put some information on the &lt;a href="http://wiki.github.com/jandot/parp"&gt;github wiki page&lt;/a&gt;, such as how to interact and what the datafiles should look like.&lt;br /&gt;&lt;br /&gt;For a little taste: here's a very brief screencast:&lt;br /&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="http://www.youtube.com/v/4aAuBTNuw1M&amp;amp;hl=en&amp;amp;fs=1&amp;amp;rel=0"&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;embed src="http://www.youtube.com/v/4aAuBTNuw1M&amp;amp;hl=en&amp;amp;fs=1&amp;amp;rel=0" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;A lot of things still need to happen:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Catch a lot of edge cases&lt;/li&gt;&lt;li&gt;Incorporate a library for fast loading of features (i.e. LocusTree, which doesn't exist yet)&lt;/li&gt;&lt;li&gt;Make interaction more straightforward: use mouse for panning/zooming for example&lt;/li&gt;&lt;li&gt;About 1,472 other things that I currently forget&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Also: I'm looking for a &lt;span style="font-weight: bold;"&gt;new name&lt;/span&gt; for pARP. pARP stands for "processing abnormal readpairs" (which what is was meant for originally), but it's actually just a genome browser using a circular representation to show linked features. Suggestions I already got are &lt;span style="font-style: italic;"&gt;encircle&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;SqWheel&lt;/span&gt; or &lt;span style="font-style: italic;"&gt;Squeal&lt;/span&gt; (the last two based on sequence-wheel; Squeal was my own idea, so I like that most at the moment :-) ).&lt;br /&gt;&lt;br /&gt;A very, very big thanks goes to &lt;span style="font-weight: bold;"&gt;Jeremy Ashkenas&lt;/span&gt;, the author of ruby-processing. With pARP I have been pushing the boundaries of what that library does, and he has adapted it for my needs as I went. See &lt;a href="http://github.com/jashkenas/ruby-processing"&gt;here&lt;/a&gt; for his ruby-processing library. Other thanks go to my colleagues Erin, Klaudia, Jon, Nelo and Chris for their ideas.&lt;br /&gt;&lt;br /&gt;pARP can be downloaded or cloned from &lt;a href="http://github.com/jandot/parp"&gt;github&lt;/a&gt;. Mac, Windows and linux are available there as well.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-5829492768640081519?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='enclosure' type='video/mp4' href='http://www.blogger.com/video-play.mp4?contentId=bfe6bc95d2476da2&amp;type=video%2Fmp4' length='0'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/5829492768640081519'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/5829492768640081519'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2009/07/first-test-release-of-circular-genome.html' title='First test release of circular genome browser'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_t6Ob1J7aZ0A/SlXIA27X9_I/AAAAAAAACxg/qQs7YrYcE84/s72-c/parp20090708.png' height='72' width='72'/></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-6265994348738086145</id><published>2009-04-16T17:33:00.017+02:00</published><updated>2009-04-28T12:44:17.346+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='locustree'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><title type='text'>LocusTree - searching genomic loci</title><content type='html'>&lt;span style="font-weight: bold;"&gt;"Contigs should not know where they are."&lt;/span&gt; That's a phrase uttered by James Bonfield when presenting his work on gap5, the successor to &lt;a href="http://staden.sourceforge.net/manual/gap4_unix_toc.html"&gt;gap4&lt;/a&gt;, a much-used assembly software suite. So you think: "Wait a second: you're talking about &lt;span style="font-style: italic;"&gt;assembly&lt;/span&gt;, and the contigs should not store their position?"&lt;br /&gt;&lt;br /&gt;This statement addresses a problem that we encounter often when working with genomic data: &lt;span style="font-weight: bold;"&gt;how to handle features&lt;/span&gt;. The approach often used is to &lt;span style="font-weight: bold;"&gt;give the feature a 'chromosome', 'start position' and 'stop position'&lt;/span&gt;. Seems reasonable, right? So if you want all features on chromosome 1 between positions 6,124,627 and 6,827,197 you just loop over all features and check if their range overlaps with this query range. Indeed: seems reasonable. Unless your collection of features goes into the millions. Suppose you have arrayCGH data with a resolution of 1 datapoint per 50 basepairs (the green/red bars in the picture). If you'd want to search for this region, you can focus on chromosome 1 (if you were so smart to create different collections per chromosome) and start comparing the start and stop position of each feature from the beginning. Chromosome 1 is almost 250Mb so would contain 5 million array probes. That's a lot of features to check.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_t6Ob1J7aZ0A/SedSqX-1A7I/AAAAAAAACGk/8szTNF1NRs8/s1600-h/locustree_1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 54px;" src="http://4.bp.blogspot.com/_t6Ob1J7aZ0A/SedSqX-1A7I/AAAAAAAACGk/8szTNF1NRs8/s400/locustree_1.png" alt="" id="BLOGGER_PHOTO_ID_5325315972334420914" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;What James alluded to, was that you should &lt;span style="font-weight: bold;"&gt;retain the position information at a higher level&lt;/span&gt;. He uses an adaption of an &lt;a style="font-weight: bold;" href="http://en.wikipedia.org/wiki/R-tree"&gt;R-Tree&lt;/a&gt;, which is a datastructure that bins all raw data into bins of a certain size (say 25 elements). Those bins themselves are binned again into larger bins, and so forth until only one bin remains: the root. Each bin knows its start and stop position relative to the bin it's contained in. To search for the same region as above, you start at the root bin which (by definition) overlaps with that query. So you go to the next level and check each of the child bins of root. You can reject those that don't overlap with your range (the red bins) and only focus on the ones that do. Continue until you reach the bottom-most layer which contains your actual features.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/SedSv8tYNqI/AAAAAAAACGs/q2NE4YrOptQ/s1600-h/locustree_2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 176px;" src="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/SedSv8tYNqI/AAAAAAAACGs/q2NE4YrOptQ/s400/locustree_2.png" alt="" id="BLOGGER_PHOTO_ID_5325316068092688034" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;This has two advantages for my work on &lt;a href="http://github.com/jandot/parp"&gt;pARP&lt;/a&gt;, the visualization tool for aberrant read pair mappings. First of all, it's faster to get those features in a given region. A major feature of the pARP tool is that you can zoom into any region. With more than 6 million values for readdepth, &lt;span style="font-weight: bold;"&gt;speed&lt;/span&gt; is a big issue. But that's not all. Think about this: if I have a screen &lt;span style="font-weight: bold;"&gt;resolution&lt;/span&gt; of 1200 pixels wide, why would I try and plot 6 million points next to each other? Twelve-hundred would be enough, isn't it? R-Tree to the rescue again. In building the tree, each bin stores some &lt;span style="font-weight: bold;"&gt;aggregate data&lt;/span&gt; from its members; for example the average readdepth. To draw the genome readdepth across the whole genome, I only have to start at the root of the LocusTree and go down to that level that contains 1200 bins. No need to load the complete dataset. At the moment LocusTree bins only take the average of whatever value its members have, but this can be expanded to contain any type of aggregation (e.g. number of objects).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_t6Ob1J7aZ0A/Sedmhj1Ui_I/AAAAAAAACG0/kvsnSklPjtw/s1600-h/parp.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 257px;" src="http://4.bp.blogspot.com/_t6Ob1J7aZ0A/Sedmhj1Ui_I/AAAAAAAACG0/kvsnSklPjtw/s400/parp.png" alt="" id="BLOGGER_PHOTO_ID_5325337811129502706" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;LocusTree is not the fastest library around (being written in ruby, and by me), but I think it does what I need it to do. This means that it might not do exactly what &lt;span style="font-style: italic;"&gt;you&lt;/span&gt; need it to do. But the code is at &lt;a href="http://github.com/jandot/locustree"&gt;github&lt;/a&gt;, so feel free to fork and improve. I did add a list of to-dos at the bottom of the README file...&lt;br /&gt;&lt;br /&gt;UPDATE: I have now realized that - due to DataMapper - this library does not work under jruby, which is necessary for pARP which uses ruby-processing. A better approach than the above would be to have C-based algorithms combined with an API, so that we don't need DataMapper.&lt;br /&gt;   &lt;div style="margin-top: 10px; height: 15px;" class="zemanta-pixie"&gt;&lt;a class="zemanta-pixie-a" href="http://reblog.zemanta.com/zemified/c0dac004-9f70-4315-a38e-7cc938636a7c/" title="Reblog this post [with Zemanta]"&gt;&lt;img style="border: medium none ; float: right;" class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=c0dac004-9f70-4315-a38e-7cc938636a7c" alt="Reblog this post [with Zemanta]" /&gt;&lt;/a&gt;&lt;span class="zem-script more-related pretty-attribution"&gt;&lt;script type="text/javascript" src="http://static.zemanta.com/readside/loader.js" defer="defer"&gt;&lt;/script&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-6265994348738086145?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/6265994348738086145/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2009/04/locustree-searching-genomic-loci.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/6265994348738086145'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/6265994348738086145'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2009/04/locustree-searching-genomic-loci.html' title='LocusTree - searching genomic loci'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_t6Ob1J7aZ0A/SedSqX-1A7I/AAAAAAAACGk/8szTNF1NRs8/s72-c/locustree_1.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-4407078898608621623</id><published>2009-03-08T21:28:00.011+01:00</published><updated>2009-03-08T23:04:35.692+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='structural variation'/><category scheme='http://www.blogger.com/atom/ns#' term='deBruijn'/><title type='text'>The good and bad of genome viewers</title><content type='html'>Back before the human genome was fully sequenced and NCBI, UCSC and Ensembl started working on visualization, it made a lot of sense to go for linear representations and use tracks for annotation. After all: &lt;span style="font-weight: bold;"&gt;chromosomes are linear&lt;/span&gt;. Using different tracks to show different types of annotation is the next logical step.&lt;br /&gt;&lt;br /&gt;But there is not just one human genome on earth; according to Wikipedia there's about 6.76 billion copies as of March 2009. So instead of talking about "the human genome" in those browsers, we talk about "&lt;span style="font-weight: bold;"&gt;the reference genome&lt;/span&gt;". Each person on earth is different, and so is each human genome. (That putting the reasoning on its head, but never mind).&lt;br /&gt;&lt;br /&gt;Differences between humans such as SNPs and microsatellites can still be shown in the track-based browsers.&lt;br /&gt;&lt;br /&gt;Things get more difficult when you're looking at &lt;span style="font-weight: bold;"&gt;structural variation&lt;/span&gt;. Structural variation messes up the backbone of the linear genome browser: you can't show differences between individuals in one straight line. Suppose you want to investigate a copy-number variation (CNV) and consult UCSC. You'd find tracks such as this:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/SbQyogQI33I/AAAAAAAABzk/023_gKAGBHA/s1600-h/chr17_cnv-1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 138px;" src="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/SbQyogQI33I/AAAAAAAABzk/023_gKAGBHA/s320/chr17_cnv-1.png" alt="" id="BLOGGER_PHOTO_ID_5310925532010438514" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Although this does give you quite some information on the CNV in question, it's &lt;span style="font-weight: bold;"&gt;not an adequate representation&lt;/span&gt; of what the different alleles actually look like. It also highlights another issue: the concept of "the reference genome". As more and more genomes are getting sequenced, is the one that was picked first the best for visualization and indeed, the reference? To be able to handle the different MHC haplotypes in Ensembl, for example, the database contains a table called "&lt;span style="font-style: italic;"&gt;assembly_exceptions&lt;/span&gt;" that contains the alternative assemblies for each haplotype.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/SbQ1CbLlHpI/AAAAAAAABzs/Etnleo5jUcI/s1600-h/chr6_mhc.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 29px;" src="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/SbQ1CbLlHpI/AAAAAAAABzs/Etnleo5jUcI/s320/chr6_mhc.png" alt="" id="BLOGGER_PHOTO_ID_5310928176348995218" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I believe that further down the line (although it might be quite a while) we might need to &lt;span style="font-weight: bold;"&gt;forget the whole notion of a reference genome&lt;/span&gt;. Two options come to mind. First of all, we could create an &lt;span style="font-weight: bold;"&gt;artificial reference&lt;/span&gt; that contains all sequence and let each real sequence we want to look at well, reference, that artificial assembly. That would mean that the different MHC haplotypes for example would all be in the same sequence. Similarly, copy-number variants containing let's say 3 to 8 copies would include all 8 in the mock-assembly. Unfortunately this still cannot cover structural variation like inter-chromosomal translocations. We can't build a single artificial assembly that would incorporate those. So here's the alternative: &lt;span style="font-weight: bold;"&gt;deBruijn graphs&lt;/span&gt;. Instead of creating a single linear representation of a reference, just let's not. We could use building blocks to build up each individual. Take a look at this picture:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_t6Ob1J7aZ0A/SbQ9nIjr0_I/AAAAAAAABz0/39jmQbnze_0/s1600-h/cnv_1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 254px;" src="http://2.bp.blogspot.com/_t6Ob1J7aZ0A/SbQ9nIjr0_I/AAAAAAAABz0/39jmQbnze_0/s400/cnv_1.png" alt="" id="BLOGGER_PHOTO_ID_5310937603098006514" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Suppose that each block is a part of a chromosome and the red and blue lines represent the path to follow to build up the chromosome for a particular individual. In this picture the red individual misses a part of that chromosome that is present in the blue individual, and another part is inverted. Notice that we don't make any (arbitrary) decision on what is the reference sequence. By dragging the blocks we can either place all red connections on one line or all blue ones, making them look like a reference.&lt;br /&gt;&lt;br /&gt;If we'd then add annotations to this picture like genes, we'd be able to display &lt;span style="font-weight: bold;"&gt;fusion genes&lt;/span&gt;. Suppose that the densely-striped block is on chromosome 7 in the red individual but on chromosome 12 in the blue one. If there's a gene on the right breakpoints we end up with a fusion gene.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_t6Ob1J7aZ0A/SbQ_Wpc-lyI/AAAAAAAABz8/6k2ZOgtYEhM/s1600-h/cnv_2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 158px;" src="http://3.bp.blogspot.com/_t6Ob1J7aZ0A/SbQ_Wpc-lyI/AAAAAAAABz8/6k2ZOgtYEhM/s400/cnv_2.png" alt="" id="BLOGGER_PHOTO_ID_5310939518893725474" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Time permitting I'm going to investigate how useful this will be in projects like CNVs in the 1000genomes project.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-4407078898608621623?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/4407078898608621623/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2009/03/good-and-bad-of-genome-viewers.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4407078898608621623'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4407078898608621623'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2009/03/good-and-bad-of-genome-viewers.html' title='The good and bad of genome viewers'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_t6Ob1J7aZ0A/SbQyogQI33I/AAAAAAAABzk/023_gKAGBHA/s72-c/chr17_cnv-1.png' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-9042852905641075566</id><published>2009-01-13T15:20:00.001+01:00</published><updated>2009-01-27T15:21:59.237+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='opinion'/><category scheme='http://www.blogger.com/atom/ns#' term='science'/><title type='text'>Who-o-o are you? Who who? Who who?</title><content type='html'>&lt;span class="zemanta-img" style="margin: 1em; float: right; display: block;"&gt;&lt;a href="http://www.flickr.com/photos/82096028@N00/2380572683"&gt;&lt;img src="http://farm4.static.flickr.com/3185/2380572683_c6b82099f1_m.jpg" alt="Identity Card - National Registration" style="border: medium none ; display: block;" height="240" width="153"&gt;&lt;/a&gt;&lt;span class="zemanta-img-attribution"&gt;Image by &lt;a href="http://www.flickr.com/photos/82096028@N00/2380572683"&gt;Danny McL&lt;/a&gt; via Flickr&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There’s been quite a lot of discussions going on lately about &lt;strong&gt;author identification&lt;/strong&gt;: Raf Aerts’ correspondence piece in Nature (doi:10.1038/453979b), &lt;a href="http://friendfeed.com/e/c1fd00ec-15f9-d894-4ea9-4ffeaac5ae28/A-specialist-OpenID-service-to-provide-unique/"&gt;discussions&lt;/a&gt; on FriendFeed, ... The issue is that it can be hard to identify who the actual author of a paper is if their name is very &lt;strong&gt;common&lt;/strong&gt;. If your name is Gudmundur Thorisson (“hi, mummi”) you’re in luck. But if you are a Li Y, Zhang L or even an Aerts J it’s a bit harder. Searching PubMed for “Aerts J” returns 299 papers. I surely don’t remember writing that many. I wish… So if a future employer would search pubmed for my name they will not get a list of my papers, but a list of papers by authors that have my name. Also, some of my papers mention jan.aerts@bbsrc.ac.uk as the &lt;strong&gt;contact email&lt;/strong&gt;. Well: you’re out of luck, I’m afraid. That email address doesn’t exist anymore because I changed jobs.&lt;br /&gt;&lt;br /&gt;The idea exists to call into life a unique ID for each author similar to the &lt;a href="http://www.crossref.org/"&gt;doi&lt;/a&gt; (“digital object identifier”) for a paper. Thomson Reuters have created &lt;a href="http://www.researcherid.com"&gt;ResearcherID&lt;/a&gt;, but because doi’s are handled through a not-for-profit CrossRef, let’s call the unique author ID a &lt;strong&gt;dsi&lt;/strong&gt; (&lt;strong&gt;“digital scientist identifier”&lt;/strong&gt;). This dsi can then be used by that scientist to identify himself wherever he needs.&lt;br /&gt;&lt;br /&gt;Here I’ll try to explain how I think this could work.&lt;br /&gt;&lt;br /&gt;But first of all: what are the &lt;strong&gt;prerequisites&lt;/strong&gt; for a dsi-based environment? Obviously, the &lt;strong&gt;journals&lt;/strong&gt; would need to request the dsi of authors on submission rather than just their names and email addresses. They are able to get names and email addresses through the dsi. And secondly, we need &lt;strong&gt;a service&lt;/strong&gt; that assigns dsi’s and where scientists can update their details and add information.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;The service/website&lt;/h2&gt;&lt;br /&gt;Let there be a website (for argument’s sake http://www.dsi.org) that &lt;strong&gt;assigns new dsi’s&lt;/strong&gt; to new authors (only one dsi per author). So I could for example be dsi.12345. This service should have additional functionality such as list of contributions, curriculum vitae, contact details, network. It should also provide a homepage or profile page for each scientist listing at least the name, affiliation and literature list (i.e. what you would get from a PubMed search). So if you’d go to &lt;strong&gt;http://www.dsi.org/dsi.12345&lt;/strong&gt; you’d see at least my name, the address of the institute I work and a list of papers that I co-authored.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Getting a dsi&lt;/h3&gt;&lt;br /&gt;It’s critical that &lt;strong&gt;one researcher only gets one dsi&lt;/strong&gt;. This is less than straightforward because I believe many researchers will not be interested enough in the whole identity story to even remember if they already had a dsi or not. So if I were to go to the dsi website and request an ID, the website would ask for my name first. It’d also ask if I used different names in author lists (e.g. I’m a woman, got married and started using my married name instead of my maiden name). Using that information the service would then search pubmed for papers that are authored by someone with my name (who might be me). It could present that list to me and ask if I’m actually that same person or not. This way we’d build up a &lt;strong&gt;minimal list of papers&lt;/strong&gt;. That minimal list would then be &lt;strong&gt;checked against the dsi database&lt;/strong&gt; to see if there isn’t already someone with my name who has claimed these papers. Logically that person would be me and it would appear that I already have a dsi. If no dsi has this name and these papers associated the new dsi can be assigned.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Contributions&lt;/h3&gt;&lt;br /&gt;A central service like this would be ideal for collaborators and possible employers to find out about contributions of a specific researcher to science. Instead of asking for author names and emails (the latter change over time anyway), a journal would ask for the dsi of all authors. If the paper gets accepted that &lt;strong&gt;journal would notify the dsi service&lt;/strong&gt; to add that paper to the researchers publication list. But it goes further than just the papers. It’s a shame that researchers virtually only get marks for their published papers (&lt;strong&gt;Publish or Perish&lt;/strong&gt;) and not for other contributions to scientific research. What about people who submit data to genome annotation databases? What about contributions to discussion in comments to blog posts, FriendFeed, ...? Setting up public databases? Writing APIs for scientific data? Think of a browser-button with which you could &lt;strong&gt;sign certain contributions&lt;/strong&gt; – &lt;strong&gt;anywhere&lt;/strong&gt;. Signing a contribution would add a link in your list of contributions in the dsi system.&lt;br /&gt;&lt;br /&gt;It should obviously be possible to log into the dsi system and edit or remove contributions that you made. That one little &lt;span class="caps"&gt;API&lt;/span&gt; you wrote 5 years ago seemed so important then but you’ve come to see it as insignificant now, for example.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Contact details&lt;/h3&gt;&lt;br /&gt;People change employer, email, address and even name. So there’s a &lt;strong&gt;problem inherent in only listing email address and institute on a paper&lt;/strong&gt;. Using the unique dsi for the authors would always point to that researcher no matter how many times he or she moved jobs or contact information. When a researcher has his contact details changed he would log onto the dsi service (we’ll come to this later) and change those data. Other people would then see those details on the researchers dsi page (http://www.dsi.org/dsi.12345), or if the researcher wants to keep them hidden send a message through the dsi service itself. The researcher’s email address does not have to be visible to the outside world.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Network&lt;/h3&gt;&lt;br /&gt;Even though you might not want to make your email address visible for the whole world, you wouldn’t mind if the people you know would see it. Your network. I think that a dsi service should contain capabilities like those from &lt;a href="http://www.linkedin.com"&gt;LinkedIn&lt;/a&gt;. You should be able to &lt;strong&gt;build a trusted network&lt;/strong&gt; (with people that you know well). This network is another important pilar in your contribution to science.&lt;br /&gt;&lt;br /&gt;There would ideally be &lt;strong&gt;different personas&lt;/strong&gt; you could set for your profile. The default would for example be that your profile page would only show your name and papers. But you might also have a full profile that is only to researchers who are logged into the service and are not further than two steps away in your network. That extended profile might show your contact details (including email), contributions outside of papers (e.g. comments on blog posts) and curriculum vitae.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;OpenID&lt;/h3&gt;&lt;br /&gt;The above explains how I would like to see the issue of &lt;strong&gt;identification&lt;/strong&gt; solved. But there is also the problem of &lt;strong&gt;authentication&lt;/strong&gt;. How do I prove that I am dsi.12345? Ideally the dsi service would be an OpenID provider so that it let’s me prove that I &lt;em&gt;own&lt;/em&gt; http://www.dsi.org/dsi.12345. Hopefully more and more websites (biomedcentral, nature, ...) would allow logging in using OpenID.&lt;br /&gt;&lt;br /&gt;Apart from serving as an OpenID provider, the dsi service should obviously also be an OpenID consumer so I don’t have to remember another username and password but can use http://jandot.myopenid.com or http://saaientist.blogspot.com to log in.&lt;br /&gt;&lt;br /&gt;I hope this gives a little bit of an idea of the environment I hope we’ll move to. Any comments welcome. Any progress even more…&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-top: 10px; height: 15px;" class="zemanta-pixie"&gt;&lt;a class="zemanta-pixie-a" href="http://reblog.zemanta.com/zemified/48a1299a-7375-4532-941f-a7693111acc1/" title="Zemified by Zemanta"&gt;&lt;img style="border: medium none ; float: right;" class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=48a1299a-7375-4532-941f-a7693111acc1" alt="Reblog this post [with Zemanta]"&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-9042852905641075566?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/9042852905641075566/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2009/01/who-o-o-are-you-who-who-who-who.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/9042852905641075566'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/9042852905641075566'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2009/01/who-o-o-are-you-who-who-who-who.html' title='Who-o-o are you? Who who? Who who?'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm4.static.flickr.com/3185/2380572683_c6b82099f1_t.jpg' height='72' width='72'/><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-2308164044722615397</id><published>2009-01-06T18:18:00.027+01:00</published><updated>2009-06-16T10:45:22.847+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='pARP'/><category scheme='http://www.blogger.com/atom/ns#' term='structural variation'/><title type='text'>To find structural variation, look at read pairs: introducing pARP</title><content type='html'>Nextgen sequencing is making a huge impact on how research is done in the genomics field. One of the ways to &lt;span style="font-weight: bold;"&gt;discover structural variants&lt;/span&gt; in a genome for example is to create a &lt;span style="font-weight: bold;"&gt;clone library&lt;/span&gt; for an individual, sequence the ends of those clones and then map those ends to the reference genome. Suppose that the clones in the library are all 150kb large, then we would expect the ends of each clone to be mapped about 150kb from each other on that reference genome, in a forward/reverse direction. Any read pair that does not follow this pattern, might indicate a structural variation. There are of course numerous spurious mapping results, so we need to ignore those.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_t6Ob1J7aZ0A/SWSNmgjehZI/AAAAAAAABsA/PvRRmXfmaZY/s1600-h/drawing.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 292px; height: 320px;" src="http://3.bp.blogspot.com/_t6Ob1J7aZ0A/SWSNmgjehZI/AAAAAAAABsA/PvRRmXfmaZY/s320/drawing.png" alt="" id="BLOGGER_PHOTO_ID_5288507555153085842" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Suppose that the resulting data look like this:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;1    1016287    1    1025027     FF     10&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;1    54809626   1    54814724    RR     20&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;1    65970649   1    67123551    DIST   32&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;1    143840263  1    143841351   RR     34&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;1    241524162  16   298176281   DIST   36&lt;/span&gt;&lt;br /&gt;&lt;/pre&gt;First two columns are the position of the first read from the pair; third and fourth columns refer to the second read from the pair. Fifth column is FF, RR or DIST: forward-forward, reverse-reverse or distance (i.e. &gt;&gt; 150kb). The last column is some arbitrary quality score assigned to the mapping of this read pair. Notice that the last of these lines shows a readpair where one end is mapped on chr1 and the other is mapped to chr16.&lt;br /&gt;&lt;br /&gt;We can do two things: analyze and then create a picture, or &lt;span style="font-weight: bold;"&gt;create a picture and then interpret&lt;/span&gt; (see also one of my &lt;a href="http://saaientist.blogspot.com/2008/11/visualize-or-summarize.html"&gt;previous post&lt;/a&gt;s). In the first approach, you'd run a statistical analysis to see if certain regions have a higher prevalence of abnormally mapped read pairs. In the second, you plot the raw data and try to identify abnormalities by eye. Of course ideally you switch between both approaches.&lt;br /&gt;&lt;br /&gt;To visualize raw read pair information I've written a tool called &lt;span style="font-weight: bold;"&gt;pARP&lt;/span&gt; (Processing Abnormal ReadPairs) and available from &lt;a href="http://github.com/jandot/parp"&gt;github&lt;/a&gt;. It's very similar to the display used by [edited] &lt;a href="http://genome.cshlp.org/content/early/2008/12/09/gr.080259.108.abstract"&gt;this paper&lt;/a&gt; by Hampton et al to display structural variation using &lt;a href="http://mkweb.bcgsc.ca/circos/"&gt;Circos&lt;/a&gt; (see picture, taken fro&lt;a href="http://flowingdata.com/2008/12/29/researchers-map-chaos-inside-cancer-cell/"&gt;&lt;/a&gt;m the circos website). But instead of just creating a static picture, pARP is meant to be an interactive tool to browse the data.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://mkweb.bcgsc.ca/circos/images/circos-conservation-small.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 800px; height: 693px;" src="http://mkweb.bcgsc.ca/circos/images/circos-conservation-small.png" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Below is a screenshot of pARP running on some test data. It doesn't look as nice as the above image, but remember that this is interactive and thus doesn't have minutes to calculate everything.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_t6Ob1J7aZ0A/SWSGWJ_jJoI/AAAAAAAABr4/VkylpTq-miM/s1600-h/parp.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 207px;" src="http://2.bp.blogspot.com/_t6Ob1J7aZ0A/SWSGWJ_jJoI/AAAAAAAABr4/VkylpTq-miM/s400/parp.png" alt="" id="BLOGGER_PHOTO_ID_5288499577637512834" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Some of the &lt;span style="font-weight: bold;"&gt;features&lt;/span&gt;:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;pARP can display abnormal readpairs (forward/forward, reverse/reverse or wrong distance), read depth and other features (e.g. segmental duplications).&lt;/li&gt;&lt;li&gt;Circular display gives overview of between-chromosome mapped readpairs.&lt;/li&gt;&lt;li&gt;Chromosomes can be dragged from the circular display to the upper or lower linear display to show (a) more detail and (b) within-chromosome aberrant readpairs (note: none in the image above).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Visible readpairs can be filtered by quality score.&lt;/li&gt;&lt;li&gt;Readpairs that are close to the mouse position are highlighted.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Prefiltering&lt;/span&gt; of the data should be minimal, and only focussed on getting the amount of data down. For example, the readpair data file &lt;span style="font-style: italic;"&gt;could&lt;/span&gt; contain all normal readpair mappings, but getting rid of those just makes the display much more visually clear and reduces the amount of data to be loaded by several orders of magnitude (obviously...).&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-weight: bold;"&gt;version just released&lt;/span&gt; (tagged &lt;span style="font-weight: bold;"&gt;v0.8&lt;/span&gt;) is workable, but not ready for prime time yet. At this moment the user has to run the tool using jruby instead of just loading it as an applet. Also the filenames to be loaded have to be changed in the parp.rb code itself. I hope to add functionality so that you can upload your own data into an applet, or use a URI to link to it. But can't promise because other work is waiting. So here's also a &lt;span style="font-weight: bold;"&gt;call for help&lt;/span&gt;: if you're interested in contributing, please do! There's a "features-yet-to-be-implemented" list further down.&lt;br /&gt;&lt;span id="profile_name" rel="/users/jashkenas" class=""&gt;&lt;br /&gt;Features &lt;span style="font-weight: bold;"&gt;not yet implemented&lt;/span&gt;:&lt;br /&gt;&lt;/span&gt;&lt;ul&gt;&lt;li&gt;pARP should be available as an applet/application.&lt;/li&gt;&lt;li&gt;User should be able to point to files or URIs representing files instead of changing filenames in the code itself.&lt;/li&gt;&lt;li&gt;Saving an image to disk (also from the applet).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Further performance improvements.&lt;/li&gt;&lt;li&gt;Fixing of not-yet-identified-but-definitely-present bugs.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;And now for some &lt;span style="font-weight: bold;"&gt;technical stuff&lt;/span&gt;. To keep redrawing times low so that the interaction wouldn't suffer too much from the huge amount of data, I had to use a few tricks. First of all, pARP makes heavy use of &lt;span style="font-weight: bold;"&gt;buffers&lt;/span&gt;. Different parts of the image are stored on different buffers. When the user interacts with the display, only the relevant buffers are updated while the others are untouched. For more info, see the &lt;a href="http://wiki.github.com/jandot/parp/buffers"&gt;github wiki page&lt;/a&gt; on the subject. Secondly, I've found out how to use ruby threads to &lt;span style="font-weight: bold;"&gt;load some data asynchronously&lt;/span&gt;. In particular the readdepth data can be a huge hog on performance; there are &gt;6 million datapoints for a genome window size of 500bp. So what happens is that (a) readdepth data for a chromosome is only loaded when that chromosome is displayed in the linear part of the image, and (b) the readdepth data is drawn onto a separate buffer that is only displayed when the thread is finished.&lt;br /&gt;&lt;br /&gt;Many &lt;span style="font-weight: bold;"&gt;thanks&lt;/span&gt; to:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Ben Fry and Casey Reas for &lt;a href="http://processing.org/"&gt;Processing&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;span id="profile_name" rel="/users/jashkenas" class=""&gt;Jeremy Ashkenas for the &lt;a href="http://github.com/jashkenas/ruby-processing"&gt;ruby API&lt;/a&gt; to Processing&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span id="profile_name" rel="/users/jashkenas" class=""&gt;&lt;br /&gt;Update: reference changed for Circos picture&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-2308164044722615397?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/2308164044722615397/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2009/01/to-find-structural-variation-look-at.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/2308164044722615397'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/2308164044722615397'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2009/01/to-find-structural-variation-look-at.html' title='To find structural variation, look at read pairs: introducing pARP'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_t6Ob1J7aZ0A/SWSNmgjehZI/AAAAAAAABsA/PvRRmXfmaZY/s72-c/drawing.png' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-8606252350076167110</id><published>2008-11-12T18:25:00.012+01:00</published><updated>2008-11-13T09:10:57.000+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='graphics'/><title type='text'>Visualize or summarize?</title><content type='html'>&lt;span class="zemanta-img"&gt;&lt;a href="http://www.flickr.com/photos/51035756584@N01/123614933"&gt;&lt;img src="http://farm1.static.flickr.com/28/123614933_b2daa0b4fe_m.jpg" alt="Visualization of my del.icio.us bookmarks" style="border: medium none ; display: block;" /&gt;&lt;/a&gt;&lt;span class="zemanta-img-attribution"&gt;Image by &lt;a href="http://www.flickr.com/photos/51035756584@N01/123614933"&gt;Kaeru&lt;/a&gt; via Flickr&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;I've recently started using raw visualizations to get an idea of what data looks like rather than writing scripts to summarize. And what I found is that presenting data visually in a raw format might be more useful than condensing everything down into just a few numbers. Trouble is that you need to know what you expect and make assumptions if you want to analyze the data. The &lt;b&gt;best tool you have for identifying trends or non-randomness is yourself&lt;/b&gt;, not R or a scripting language.&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Bought the &lt;a href="http://www.amazon.co.uk/Visualizing-Data-Explaining-Processing-Environment/dp/0596514557/ref=sr_1_1?ie=UTF8&amp;s=books&amp;qid=1226513378&amp;sr=1-1"&gt;Visualizing Data&lt;/a&gt; book by Ben Fry to help me with this. It explains how to use the &lt;a href="http://processing.org"&gt;Processing&lt;/a&gt; language to present data in a meaningful way. As far as I understand, Processing is a wrapper around the java language so that it becomes much more intuitive to use for simple people like me. The language is so easy that there was only a very small learning curve for me, even though I didn't know anything about java other than that it's an island and a coffee. In several of my projects I now start with writing a simple processing script and then throw all my data at it. &lt;b&gt;No assumptions made.&lt;/b&gt; The fact that it's easy to &lt;b&gt;interact&lt;/b&gt; with a display with mouse or keyboard makes it even more useful.&lt;br /&gt;&lt;br /&gt;&lt;p&gt;The processing code editor makes creating a java applet or application a matter of one click, so it's easy to make your displays available for other people.&lt;br /&gt;&lt;br /&gt;&lt;p&gt;It's only &lt;i&gt;after&lt;/i&gt; having a look at the data that I write analysis scripts or help other people in deciding how to analyze/summarize. That analysis can then help to for example make a more &lt;b&gt;opinionated display&lt;/b&gt;. So it's display, analyze, display, analyze, ... as in a hermeneutic circle.&lt;br /&gt;&lt;br /&gt;&lt;p&gt;One issue with this approach is that you have to be able to think of a &lt;b&gt;meaningful&lt;/b&gt; display. And I must say that's often (but not always) the more difficult bit. I started following the RSS feeds of some visualization blogs like &lt;a href="http://flowingdata.com"&gt;FlowingData&lt;/a&gt; as well as the &lt;a href="http://processing.org"&gt;Processing website&lt;/a&gt; itself to get exposed to different types of visualization, which does help.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Update&lt;/b&gt;: Follow the discussion on &lt;a href="http://friendfeed.com/e/b92d006c-cf86-faaa-3af0-70377018443b/Visualize-or-summarize/"&gt;FriendFeed&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-8606252350076167110?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/8606252350076167110/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2008/11/visualize-or-summarize.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/8606252350076167110'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/8606252350076167110'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2008/11/visualize-or-summarize.html' title='Visualize or summarize?'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm1.static.flickr.com/28/123614933_b2daa0b4fe_t.jpg' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-319916956286145562</id><published>2008-09-13T13:00:00.000+02:00</published><updated>2008-09-13T12:58:43.578+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><title type='text'>Data visualization</title><content type='html'>Today is "Data management, mining, curation and visualization" day at the Genome Informatics conference in Hinxton. It might be one of the more interesting ones for me, because that's what I do: manage, mine, curate and attempt to visualize. And I must say the last bit the most difficult. It's not difficult to upload results into a genome browser, but is it the best way?&lt;br /&gt;&lt;br /&gt;I say we have to &lt;span style="font-weight: bold;"&gt;break free from the track&lt;/span&gt;. Ninety percent of all visualizations today in genomics is track-based (add DAS tracks to Ensembl, upload BED files to UCSC or run your own gbrowse). It's ideal for showing features on a chromosome, but it's used even if it's not the optimal tool (a feature shared with Microsoft Excel, but let's not go there). Why's that? Because that's what there is, and they &lt;span style="font-style: italic;"&gt;do&lt;/span&gt; provide very useful functionality. But at the same time, having them available &lt;span style="font-weight: bold;"&gt;tempers the search&lt;/span&gt; for new and innovative ways of visualizing data. Having a computer at hand doesn't help either, I think: it's just much easier in PowerPoint to draw a collection of squares than a rich multi-facetted picture. That's just more easily done by hand, but that's not what we do, is it?&lt;br /&gt;&lt;br /&gt;One of the &lt;a href="http://www.nature.com/nature/journal/v455/n7209/full/455030a.html"&gt;article&lt;/a&gt;s in Nature's &lt;a href="http://www.nature.com/nature/journal/v455/n7209/index.html"&gt;Big Data issue&lt;/a&gt; calls for artists and &lt;span style="font-weight: bold;"&gt;visualization experts&lt;/span&gt; to be involved before all data are gathered. This idea got quite a few comments on &lt;a href="http://friendfeed.com/e/671f5fbf-b097-7129-d0e0-7a89bd5b0099/We-propose-that-graphic-artists-communicators-and/"&gt;FriendFeed&lt;/a&gt; as well. I do agree with the idea of visualization experts being involved in many projects, but that &lt;span style="font-weight: bold;"&gt;visualization expert should be you&lt;/span&gt;. Well... you don't need to be an &lt;span style="font-style: italic;"&gt;expert&lt;/span&gt;, but still you should have an idea on how to show the gist of your results. I think that's one of the important things that's missing in MSc education (apart with a good introduction to data management): some course in visualization concepts. How do you visualize time-series? How do you visualize differences? And what about time-series of differences?&lt;br /&gt;&lt;br /&gt;Small example: I've been asked to think about how to visualize copy-number variations between individuals. The most obvious to do is what's used on the UCSC and any track browser: show a box where the variation is. But it's a &lt;span style="font-style: italic;"&gt;variation&lt;/span&gt;, right? So what does this box mean? That some individuals miss that bit? That it's duplicated? What individuals? Using a track-based genome browser, you &lt;span style="font-style: italic;"&gt;must&lt;/span&gt; make one individual the reference.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/SMuZ2GPXJzI/AAAAAAAAAqc/jR9RX7wv__I/s1600-h/ucsc_cnv.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/SMuZ2GPXJzI/AAAAAAAAAqc/jR9RX7wv__I/s320/ucsc_cnv.png" alt="" id="BLOGGER_PHOTO_ID_5245455345670104882" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;This is not about tools (there's &lt;a href="http://processing.org/"&gt;Processing&lt;/a&gt;), but about a mind shift.&lt;br /&gt;&lt;br /&gt;As it happens, I hope to get hold of a small tablet in the next week or so to replace my mouse and relieve my RSI a bit, so that might be a good opportunity for me to at least explore a bit.&lt;br /&gt;&lt;br /&gt;Keep drawing.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-319916956286145562?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/319916956286145562/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/11/data-visualization.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/319916956286145562'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/319916956286145562'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/11/data-visualization.html' title='Data visualization'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_t6Ob1J7aZ0A/SMuZ2GPXJzI/AAAAAAAAAqc/jR9RX7wv__I/s72-c/ucsc_cnv.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-3098848678203533713</id><published>2008-08-26T20:55:00.004+02:00</published><updated>2008-08-26T21:27:05.220+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='git'/><category scheme='http://www.blogger.com/atom/ns#' term='data management'/><title type='text'>Using git to sync server with laptop</title><content type='html'>After &lt;a href="http://saaientist.blogspot.com/2008/06/bioruby-with-git-how-would-that-work.html"&gt;investigating&lt;/a&gt; &lt;span style="font-weight: bold;"&gt;git&lt;/span&gt; for the bioruby project, I started using it on basically every project I run. And what do I use it for? Two things: &lt;span style="font-weight: bold;"&gt;keeping track of changes &lt;/span&gt;(duh) and &lt;span style="font-weight: bold;"&gt;syncing between server and laptop&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;I normally try to get IT so far to let me &lt;span style="font-weight: bold;"&gt;mount&lt;/span&gt; my server Documents folder on my laptop when I'm at work. So ~/Documents actually points to my network drive. That's nice, because I don't have to bother with keeping track of several places to store my documents. If I change anything on my network drive, it looks like it's been changed locally. And vice versa.&lt;br /&gt;&lt;br /&gt;But: what if I'm at home (where I work just a bit more than the wife would like)? I can still SSH into the server and do some work, but I can't mount that network drive. So I started creating a &lt;span style="font-weight: bold;"&gt;~/LocalDocuments&lt;/span&gt; folder on my laptop in which I copied any files I needed. But that obviously feels wrong as I now have more than one place to put my files: either on my network drive (~/Documents) or locally on my laptop (~/LocalDocuments).&lt;br /&gt;&lt;br /&gt;...until I started using &lt;span style="font-weight: bold;"&gt;git&lt;/span&gt;...&lt;br /&gt;&lt;br /&gt;When I start a new project, I create a new folder on the server: ~/Documents/Projects/some_new_project. Within that folder, I run "git init" and commit a README file. This creates the git repository. Next thing: clone it on the laptop.&lt;br /&gt;&lt;br /&gt;On the server:&lt;br /&gt;&lt;pre name="code" class="bash"&gt;&lt;br /&gt;mkdir /path_to_directory/some_new_project&lt;br /&gt;cd /path_to_directory/some_new_project&lt;br /&gt;git init&lt;br /&gt;touch README&lt;br /&gt;git commit -a -m "First commit"&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;On the laptop:&lt;br /&gt;&lt;pre name="code" class="bash"&gt;&lt;br /&gt;git clone ssh://my_name@network_server/path_to_directory/some_new_project&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Now I just work in ~/LocalDocuments, commit all changes in my local git repository and (very important:) &lt;span style="font-weight: bold;"&gt;push it back&lt;/span&gt; onto the server.&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="bash"&gt;&lt;br /&gt;git push&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;NB: This setup has already saved me from not a small (well: medium) disaster. For some reason (no coffee yet?) the very first thing I did one morning was login on the server, go to the project folder I had been working on for about a month, and do a "rm -r -f this_project". Aaargh! After wiping away all that cold sweat I realized I only had to clone the repository on my laptop back onto the server.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-3098848678203533713?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/3098848678203533713/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2008/08/using-git-to-sync-server-with-laptop.html#comment-form' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/3098848678203533713'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/3098848678203533713'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2008/08/using-git-to-sync-server-with-laptop.html' title='Using git to sync server with laptop'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-9210086974877100178</id><published>2008-06-23T12:37:00.022+02:00</published><updated>2008-12-09T19:02:28.242+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioruby'/><category scheme='http://www.blogger.com/atom/ns#' term='git'/><title type='text'>Bioruby with git: how would that work?</title><content type='html'>&lt;em&gt;Disclaimer&lt;/em&gt;: This blog post is the result of several iterations of writing/discussion/rewriting from Anthony Underwood, Michael Barton, Matt Wood and myself, with additional help from Paul Thornthwaite.&lt;br /&gt;&lt;em&gt;Disclaimer nr 2&lt;/em&gt;: We are not yet git veterans ourselves, so if you see simpler ways of doing what we describe below (or spot any errors), please let us know so we can update this post and put it onto the bioruby wiki as well.&lt;br /&gt;&lt;em&gt;Disclaimer nr 3&lt;/em&gt;: This is a &lt;em&gt;proposal&lt;/em&gt;. Bioruby has &lt;em&gt;not&lt;/em&gt; moved to git yet. However, we are working on it and trying to get the support from the main developers. &lt;i&gt;&lt;b&gt;Update&lt;/b&gt;: bioruby has been converted to git (thanks, Anthony) and is not available on github. So you &lt;i&gt;can&lt;/i&gt; clone or fork now. However, the official development is still on CVS.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;b&gt;Update&lt;/b&gt;: I have discovered a very good presentation on how to work and collaborate with git. If you're interested in using git, have a look at &lt;a href="http://www.gitcasts.com/posts/railsconf-git-talk"&gt;this talk&lt;/a&gt;. You can fast forward to 1hr10min27sec where he starts talking about the practical use. Very strongly recommended.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;In this blog post, we try to give some &lt;span style="font-weight: bold;"&gt;guidelines on how people can contribute to the bioruby code&lt;/span&gt; if/when that code will become available on github. The rationale for what we describe here is very much based on the premise that the &lt;span style="font-weight: bold;"&gt;job for the maintainer(s) of bioruby should be as simple as possible&lt;/span&gt;. Their workload should be as light as possible; this means that there are some additional steps that any contributor has to go through.&lt;br /&gt;What follows is only a proposal. This is not a standard operating procedure; it’s only a guideline. Feel free to digress from it or use a completely different workflow. But remember: keep it simple for the maintainers.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Git&lt;/h2&gt;&lt;br /&gt;&lt;strong&gt;Distributed source control.&lt;/strong&gt; Git is a truly distributed source control system, and in contrast with &lt;span class="caps"&gt;CVS&lt;/span&gt; or &lt;span class="caps"&gt;SVN&lt;/span&gt;, there is &lt;strong&gt;no central repository&lt;/strong&gt;. With &lt;span class="caps"&gt;CVS&lt;/span&gt; or &lt;span class="caps"&gt;SVN&lt;/span&gt;, every time someone checks out or exports the repository, his own copy is so-to-speak subordinate to the central one. Not so with git: every single clone is equivalent; none is more important than another. In technical terms, the copy of bioruby on your laptop is as important as the one that for example Toshiaki maintains. One of the big advantages is that continued support is more likely should a key developer move on to pastures new (or github goes up in smoke), since the community can simply elect a new "blessed" repository (see below).&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;A blessed repository.&lt;/strong&gt; Noticed that I said “in technical terms”? In some cases, like for bioruby, we would obviously like to have some repository that we would consider the ‘true’ one. Enter the notion of a &lt;strong&gt;“blessed” repository&lt;/strong&gt;. This is purely &lt;span style="font-weight: bold;"&gt;by convention&lt;/span&gt;: the community appoints one particular repository as the &lt;strong style="font-weight: normal;"&gt;main&lt;/strong&gt; one.&lt;br /&gt;A good place to put this repository is &lt;a href="http://github.com/"&gt;Github&lt;/a&gt;. For bioruby, this blessed repository will start out to be &lt;strong&gt;http://github.com/bioruby/bioruby&lt;/strong&gt;. Official bioruby builds will take place from there. However, development can take place in additional, personal repositories.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Forking.&lt;/strong&gt; Any development of bioruby would happen in clones of this blessed repository. Using the “fork” button on Github not only creates a clone, but it automatically puts that clone on Github itself as well. (Forking has the added value of the github social aspect where the network of changes can be viewed.) So if I would want to contribute, I would fork from bioruby/bioruby (that is: username/projectname) which would automatically create http://github.com/jandot/bioruby.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Guidelines for contribution&lt;/h2&gt;&lt;br /&gt;There are several ways of contributing: you can either &lt;span style="font-weight: bold;"&gt;create a patch or use a fork/clone&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Here we’ll try to explain how contribution could work with forking for Bioruby, both from the individual contributor’s view as from the view of the person(s) managing the blessed repository. What follows is not a Standard Operating Procedure. You do not have to do it like this. However, it will make it easier on the blessed maintainers to merge your code.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;A. Using patches&lt;/h3&gt;&lt;br /&gt;&lt;h4&gt;A.1 The contributor&lt;/h4&gt;&lt;br /&gt;The simplest way to contribute is to send in patches. RailsCasts has a &lt;a href="http://railscasts.com/episodes/113"&gt;great screencast&lt;/a&gt; explaining this.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Creating a fork&lt;/strong&gt;&lt;br /&gt;Click the “fork” button on the bioruby/bioruby page. This will create a new repository in your own namespace: jandot/bioruby. It’s on this clone that you will be working; you will not touch bioruby/bioruby itself.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Making changes&lt;/strong&gt;&lt;br /&gt;To actually start making changes (e.g. you want to add functionality for Ensembl cigar format), you create a local clone on your own computer (step 2 in the picture):&lt;br /&gt;&lt;pre&gt;git clone git@github.com:jandot/bioruby.git&lt;br /&gt;&lt;/pre&gt;This will be your local &lt;em&gt;master&lt;/em&gt; branch. The first thing to do after cloning your own fork, is to create an additional branch for the feature you want to work on: &lt;em&gt;add_cigar_format&lt;/em&gt; (step 3)&lt;br /&gt;&lt;br /&gt;The command to do this:&lt;br /&gt;&lt;pre&gt;git checkout -b add_cigar_format&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This will create the new branch and check it out so it becomes your active one. From the fluxbox wiki (http://fluxbox-wiki.org/index.php/Git_-_using): “Branching and merging is very powerful in git. You can create thousands of local branches, one for each bug you work on or feature you implement. It is good practice to do this because it safes your from accidentally pushing changes to another repository.”&lt;br /&gt;&lt;br /&gt;So you’ll end up with 2 branches (do a “git branch”):&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;il&gt;master&lt;/il&gt;: a reflection of the master branch of your remote repository&lt;/li&gt;&lt;li&gt;&lt;il&gt;add_cigar_format&lt;/il&gt;: is where the actual work is done&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;The “git branch” should have a star in front of &lt;em&gt;add_cigar_format&lt;/em&gt; because that’s your current branch. If &lt;em&gt;master&lt;/em&gt; is starred, do a “git checkout add_cigar_format” to change to this branch.&lt;br /&gt;&lt;br /&gt;Now you can edit and change to your heart’s content. The current branch you’re working on maintains an index of files that git is tracking. You can find the current status of the branch by typing&lt;br /&gt;&lt;pre&gt;git status&lt;br /&gt;&lt;/pre&gt;which will list the current status of all the files. Changes can be committed to the local index by using the command.&lt;br /&gt;&lt;pre&gt;git add file&lt;br /&gt;&lt;/pre&gt;The index is an intermediary between the working copy files you are editing, and the changes committed to the repositroy. Changes can be committed from the index to your local repository using the command&lt;br /&gt;&lt;pre&gt;git commit&lt;br /&gt;&lt;/pre&gt;This command will also prompt you for a message describing the commit. Try not to do too much work before committing. &lt;strong&gt;A single commit should concern (part of) a single conceptual change with its tests&lt;/strong&gt;. It’s good practice to commit often (and several commits per conceptual change), but do try not to mix different changes into one commit. This will make it harder afterwards if a commit has to be reverted.&lt;br /&gt;&lt;br /&gt;Commits are applied to the only current checked out branch (i.e. &lt;em&gt;add_cigar_format&lt;/em&gt;), and do not affect any other branches, or the original repository. Also, if you have to make site-specific changes (e.g. hard-coding a proxy server in one of the files), try to put those changes in one single commit. This will make it easier later to remove them.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_t6Ob1J7aZ0A/SF-RQp5CIDI/AAAAAAAAAmw/RhmejZ9Nuls/s1600-h/git_guideline_patch.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_t6Ob1J7aZ0A/SF-RQp5CIDI/AAAAAAAAAmw/RhmejZ9Nuls/s320/git_guideline_patch.png" alt="" id="BLOGGER_PHOTO_ID_5215046608827326514" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Preparing the patch&lt;/strong&gt;&lt;br /&gt;When you think your change is ready for inclusion in the blessed repository (and you’ve included tests as well), you can &lt;strong&gt;create a patch file&lt;/strong&gt;. To make sure that the blessed repository maintainers will have no problem merging your version, you will want to make the patch reflect the latest version of the blessed repository (step 5).&lt;br /&gt;&lt;pre&gt;git remote add blessed git://github.com/bioruby/bioruby.git&lt;br /&gt;git fetch blessed&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;So now you can check that the patch you will submit will only contain the changes that you want to be included in the blessed repository. One of the things to look out for is that there are not site-specific configurations in your branch (e.g. a hard-coded proxy or directory path, no “STDERR.puts”, ...). Hopefully, you put all those site-specific changes in a separate commit as described above. To get rid of them, you just revert that commit. “git log” will show you the &lt;span class="caps"&gt;SHA1&lt;/span&gt; of that particular commit (the long crazy string), and you just run “git revert [that_SHA1]". After that, check your changes:&lt;br /&gt;&lt;pre&gt;git log -p blessed..feature_c&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;When that’s done, you can create the actual patch (step 6):&lt;br /&gt;&lt;pre&gt;git format-patch blessed..feature_C&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This creates a file that you can send to the maintainer (step 7). And you’re done…&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;A.2 The maintainer&lt;/h4&gt;&lt;br /&gt;The maintainer gets an email from someone containing a patch. The first thing to do, is to create a new branch and merge the changes into that branch.&lt;br /&gt;&lt;pre&gt;git checkout -b feature_c&lt;br /&gt;git am &amp;lt;0001-feature_C_commit_message.patch&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Of course he would want to check the changes by comparing the new version of the code with the one that is in the blessed repository (i.e. the master).&lt;br /&gt;&lt;pre&gt;git log feature_c..master&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;If everything looks OK, he can then merge the changes into master itself and push it up onto github.&lt;br /&gt;&lt;pre&gt;git branch master&lt;br /&gt;git merge feature_c&lt;br /&gt;git push&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;And he’s ready. Only thing left to do is remove the branch he created during the process.&lt;br /&gt;&lt;pre&gt;git branch -d feature_c&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;B. Using a pull request&lt;/h3&gt;&lt;br /&gt;&lt;h4&gt;B.1 The contributor&lt;/h4&gt;&lt;br /&gt;This type of contribution starts out exactly the same as the one with patches: you fork/clone, create a feature branch and hack away.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_t6Ob1J7aZ0A/SF-Rg8pHIzI/AAAAAAAAAm4/7qsyzB4IEZw/s1600-h/git_guideline.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://4.bp.blogspot.com/_t6Ob1J7aZ0A/SF-Rg8pHIzI/AAAAAAAAAm4/7qsyzB4IEZw/s320/git_guideline.png" alt="" id="BLOGGER_PHOTO_ID_5215046888738726706" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Preparing the pull&lt;/strong&gt;&lt;br /&gt;When you think your change is ready for inclusion in the blessed repository, you will create a branch specific for this pull (e.g. called &lt;em&gt;to_pull&lt;/em&gt;; step 5): “git branch -b to_pull”.&lt;br /&gt;&lt;br /&gt;To make sure that the blessed repository maintainers will have no problem merging your version, you have to rebase your branch (steps 6 and 7).&lt;br /&gt;&lt;pre&gt;git remote add blessed git://github.com/bioruby/bioruby.git&lt;br /&gt;git fetch blessed&lt;br /&gt;git rebase blessed/master&lt;br /&gt;git checkout blessed/master fileA_for_user_environment_only&lt;br /&gt;git checkout blessed/master fileB_for_user_environment_only&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;At this point, a “git log -p blessed/master..to_pull” can help you check that the differences between your _to_pul_l branch and the blessed branch only contain the changes that you intend to be pulled (e.g. getting rid of “STDERR.puts” statements).&lt;br /&gt;&lt;br /&gt;When you’re satisfied, you can put the &lt;em&gt;to_pull&lt;/em&gt; branch onto your remote repository so it becomes available for the maintainers of the blessed repository (step 8):&lt;br /&gt;&lt;pre&gt;git push origin to_pull:refs/heads/to_pull&lt;br /&gt;&lt;/pre&gt;and push the “Send pull request” button on github.&lt;br /&gt;&lt;br /&gt;After that, wait for any news if your change is accepted or not. When your remote &lt;em&gt;to_pull&lt;/em&gt; branch becomes obsolete, you can remove it (step 10) with&lt;br /&gt;&lt;pre&gt;git push origin :to_pull&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;B.2 The maintainer&lt;/h4&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_t6Ob1J7aZ0A/SF-RtgRSFXI/AAAAAAAAAnA/yrXHpvrQ60A/s1600-h/git_guideline_maintainer.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_t6Ob1J7aZ0A/SF-RtgRSFXI/AAAAAAAAAnA/yrXHpvrQ60A/s320/git_guideline_maintainer.png" alt="" id="BLOGGER_PHOTO_ID_5215047104460887410" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The first thing the maintainer has to do, is get the latest version of his own (i.e. the blessed) repository.&lt;br /&gt;&lt;pre&gt;git clone git@github.com:bioruby/bioruby.git&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Then he can get a copy of your to_pull branch:&lt;br /&gt;&lt;pre&gt;git remote add your_name git://github.com/your_name/bioruby.git&lt;br /&gt;git checkout -b your_name/to_pull&lt;br /&gt;git pull your_name to_pull&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;...and check what the change looks like.&lt;br /&gt;&lt;pre&gt;git log -p master..your_name/to_pull&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;If he’s satisfied, he can merge your changes into the blessed master branch.&lt;br /&gt;&lt;pre&gt;git merge your_name/to_pull&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;If there are no conflicts, he can then push the new version up onto github:&lt;br /&gt;&lt;pre&gt;git push&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Useful links&lt;br /&gt;&lt;/h2&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://tomayko.com/writings/the-thing-about-git"&gt;The thing about git&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://drnicwilliams.com/2008/02/03/using-git-within-a-team/"&gt;Using git within a team -&gt; must-read&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://www-cs-students.stanford.edu/~blynn/gitmagic/"&gt;A large but very informative and simple document&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://oss.oracle.com/osswiki/GitRepositories/ForMaintainers"&gt;Git repositories for maintainers&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://wiki.samba.org/index.php/Using_Git_for_Samba_Development"&gt;Using git for samba development&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://linux.yyz.us/git-howto.html"&gt;Git howto&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://www.sourcemage.org/Git_Guide"&gt;GitGuide: intermediate and advanced git&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Who we are&lt;/h2&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.linkedin.com/in/anthonyunderwood"&gt;Anthony&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://www.bioinformaticszen.com/"&gt;Mike&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://www.greenisgood.co.uk/"&gt;Matt&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://saaientist.blogspot.com/"&gt;jan.&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/that_sha1&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-9210086974877100178?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/9210086974877100178/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2008/06/bioruby-with-git-how-would-that-work.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/9210086974877100178'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/9210086974877100178'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2008/06/bioruby-with-git-how-would-that-work.html' title='Bioruby with git: how would that work?'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_t6Ob1J7aZ0A/SF-RQp5CIDI/AAAAAAAAAmw/RhmejZ9Nuls/s72-c/git_guideline_patch.png' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-361717013607114189</id><published>2008-06-03T11:23:00.010+02:00</published><updated>2008-06-03T15:38:29.115+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='api'/><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='visualization'/><category scheme='http://www.blogger.com/atom/ns#' term='ucsc'/><category scheme='http://www.blogger.com/atom/ns#' term='data management'/><category scheme='http://www.blogger.com/atom/ns#' term='ensembl'/><title type='text'>Would you want to contribute to a small open-source project?</title><content type='html'>Just a quick plug to see if I can find people interested in helping me out in some of my projects.&lt;br /&gt;&lt;br /&gt;In the last 2 years, I started four open source projects (well: the last one was today...), each of which scratches my own itch and does what it needs to do for me. However, some features will have to be added and bugs be fixed to make them more useful to others (you, that is...).&lt;br /&gt;&lt;br /&gt;If you are using one of these projects, please think about &lt;span style="font-weight: bold;"&gt;contributing&lt;/span&gt;. With that all-new fancy git version control system, it should be simpler than ever to get your own copy, tweak things a bit and send the improvements back.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;1. Bio::Graphics&lt;/span&gt;&lt;br /&gt;The Bio::Graphics library "allows for drawing overviews of genomic regions, similar to the pictures drawn by gbrowse" (from the &lt;a href="http://bio-graphics.rubyforge.org"&gt;homepage&lt;/a&gt;). I believe &lt;a href="http://github.com/dgtized"&gt;dgtized&lt;/a&gt; (Charles Comstock) and I have done not a bad job in creating something that is really useful, but of course some features are still missing.&lt;br /&gt;&lt;br /&gt;Some things that come to mind that need help:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Find a good description on how to &lt;span style="font-style: italic;"&gt;install it on a Mac&lt;/span&gt; (I've just changed job and am trying to do that, resulting in a major headache and no bio-graphics still).&lt;/li&gt;&lt;li&gt;Add new features such as a type of track to show &lt;span style="font-style: italic;"&gt;continuous data&lt;/span&gt; (e.g. GC-content). We're thinking about ways to implement this, but additional ideas are welcome to get that done.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Is someone working on a &lt;span style="font-style: italic;"&gt;gbrowse-like application&lt;/span&gt; in ruby?&lt;/li&gt;&lt;li&gt;Although I haven't found it a bottleneck yet, it'd might be a good idea to look at the &lt;span style="font-style: italic;"&gt;performance&lt;/span&gt; of the thing.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;A full tutorial and more information can be found &lt;a href="http://bio-graphics.rubyforge.org/"&gt;here&lt;/a&gt;. The actual code repository has been moved from rubyforge to github and can be downloaded at &lt;a href="http://github.com/jandot/bio-graphics"&gt;http://github.com/jandot/bio-graphics&lt;/a&gt;. To get your copy:                &lt;code&gt;git clone git://github.com/jandot/bio-graphics.git&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;2. Ruby API to the Ensembl database&lt;/span&gt;&lt;br /&gt;In late spring 2007, I started the ruby API to the &lt;a href="http://www.ensembl.org"&gt;Ensembl database&lt;/a&gt;. This API relies on ActiveRecord and very much tries to copy the functionality of the perl API (including transfer of coordinates between coordinate systems). Just recently I was glad to hear people at the Sanger here are looking/have looked into it as well. At the moment, only the core database is covered but it would be nice if the other ones (funcgen, variation) would be added as well. In addition, the API was developed based on Ensembl release 45. With a new release coming out every few months, the API has to be tested against those as well.&lt;br /&gt;&lt;br /&gt;What needs help:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Add an API for the &lt;span style="font-style: italic;"&gt;other databases&lt;/span&gt;, including variation and funcgen.&lt;/li&gt;&lt;li&gt;Keep the API testing going for each &lt;span style="font-style: italic;"&gt;new release&lt;/span&gt; of the database.&lt;span style="display: block;" id="formatbar_Buttons"&gt;&lt;span class="on" style="display: block;" id="formatbar_CreateLink" title="Link" onmouseover="ButtonHoverOn(this);" onmouseout="ButtonHoverOff(this);" onmouseup="" onmousedown="CheckFormatting(event);FormatbarButton('richeditorframe', this, 8);ButtonMouseDown(this);"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="display: block;" id="formatbar_Buttons"&gt;&lt;span class="on" style="display: block;" id="formatbar_CreateLink" title="Link" onmouseover="ButtonHoverOn(this);" onmouseout="ButtonHoverOff(this);" onmouseup="" onmousedown="CheckFormatting(event);FormatbarButton('richeditorframe', this, 8);ButtonMouseDown(this);"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;The full tutorial can be found on &lt;a href="http://bioruby-annex.rubyforge.org"&gt;rubyforge&lt;/a&gt;, but the source control has also moved to github at &lt;a href="http://github.com/jandot/ruby-ensembl-api"&gt;http://github.com/jandot/ruby-ensembl-api&lt;/a&gt;. Get your copy using                &lt;code&gt;git clone git://github.com/jandot/ruby-ensembl-api.git&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;3. Ruby API to the UCSC database&lt;/span&gt;&lt;br /&gt;I just started out the API for the &lt;a href="http://genome.ucsc.edu"&gt;UCSC database&lt;/a&gt;. There is data available in that database that cannot be found in Ensembl yet, for example copy number variations. So I started a new project (now solely on github) by copy-paste-modifying some code of the Ensembl API. Unfortunately, there are 3415 tables in the hg18 database (yes, that's three thousand four hundred and fifteen). Obviously, I only created interfaces for the tables that I will need at work.&lt;br /&gt;&lt;br /&gt;What needs help:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Add additional tables to the API.&lt;/li&gt;&lt;li&gt;Think about additional functionality that might be added to some of the models.&lt;/li&gt;&lt;li&gt;A tutorial.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Again, get your copy from &lt;a href="http://github.com/jandot/ruby-ucsc-api"&gt;http://github.com/jandot/ruby-ucsc-api&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;4. Simple Project Logger&lt;/span&gt;&lt;br /&gt;As mentioned in my &lt;a href="http://saaientist.blogspot.com/2008/05/keeping-track-of-things-using-labbook.html"&gt;previous post&lt;/a&gt;, I use a simple rails application to keep track of the things I work on; a digital labbook, basically. That application is called Simple Project Logger (or its unix name sprolog). It's only task is to allow me to create tasks within projects and log what I've done for each task. Sprolog works good enough for me at the moment, but there is some bad bug I can't get out. In addition, it just looks ugly.&lt;br /&gt;&lt;br /&gt;So if you're interested:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The authentication doesn't work yet. You &lt;span style="font-style: italic;"&gt;can&lt;/span&gt; login using OpenID, but it is still possible to view anyone's projects and tasks by just typing the full URL. I know it will need a "before_filter :login_required", but that just breaks the thing.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Sprolog could definitely use some CSS-love.&lt;/li&gt;&lt;/ul&gt;You can download sprolog from &lt;a href="http://github.com/jandot/sprolog"&gt;http://github.com/jandot/sprolog&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I'll buy you a beer.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-361717013607114189?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/361717013607114189/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2008/06/would-you-want-to-contribute-to-small.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/361717013607114189'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/361717013607114189'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2008/06/would-you-want-to-contribute-to-small.html' title='Would you want to contribute to a small open-source project?'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-7208240780907157861</id><published>2008-05-20T10:55:00.012+02:00</published><updated>2009-10-16T01:57:50.725+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='data management'/><category scheme='http://www.blogger.com/atom/ns#' term='organization'/><title type='text'>Keeping track of things: using a labbook for bioinformatics</title><content type='html'>It's been a while since my last post. Left my last job, was unemployed for a month (while still chairing a session at a conference), and just started my new position here at the Sanger Institute.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.flickr.com/photos/protactinium/2335457878/" title="CS4 project - a page from my lab book by hapticflapjack, on Flickr"&gt;&lt;img src="http://farm4.static.flickr.com/3186/2335457878_866c7686fa.jpg" width="350" height="500" alt="CS4 project - a page from my lab book" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;On every job you're able to pick up some new things that can help you out later. One of the good ones from Roslin was how to keep a &lt;span style="font-weight: bold;"&gt;labjournal for bioinformatics&lt;/span&gt;. In the position &lt;span style="font-style: italic;"&gt;before&lt;/span&gt; Roslin (at Wageningen University in the Netherlands), I remember having trouble remembering what I did to my data. So I was really happy to see that they actually had thought about those things in Roslin...&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;So what's the problem?&lt;/span&gt;&lt;br /&gt;There are very significant &lt;span style="font-weight: bold;"&gt;parallels between bench-based labwork and computer-based data mangling&lt;/span&gt;. In both, you take some &lt;span style="font-style: italic;"&gt;input&lt;/span&gt; (e.g. an eppendorf filled to the brim with DNA, or a data file downloaded from the internet), you perform some &lt;span style="font-style: italic;"&gt;actions&lt;/span&gt; on it (e.g. PCR on such and such temperatures, or a grep followed with a sort and uniq) to get some &lt;span style="font-style: italic;"&gt;output&lt;/span&gt; (e.g. an agarose-gel picture, or a number). In the wet-lab world, it's almost always mandatory to keep a lab journal in which you write down where you got the DNA from, which concentrations you used from which chemicals and what voltage you use for running the gel. However, for people doing a little bit of scripting to get some data out of a big set of files for example, there often is no such obligation. "I just played around with the data", you'll hear. But they will need a mighty good memory if they are to recall what they did after a couple of weeks. Bioinformaticians (i.c. those who manipulate data) have the same obligation as any other researcher: &lt;span style="font-weight: bold;"&gt;your work should be described in enough detail that other researchers can repeat the steps&lt;/span&gt; to get to the same result.&lt;br /&gt;&lt;br /&gt;Enter the &lt;span style="font-weight: bold;"&gt;SOP for bioinformatics&lt;/span&gt; written by my former PI (little wave to Andy). It has some really good suggestions for people involved in data handling, mangling and mining. In this post, I will try to highlight some of them. Note that this is &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; about application or API development, but about data. (I hope to post a new blog entry about using svn and/or git later about that).&lt;br /&gt;&lt;br /&gt;The central tool used for recording bioinformatics work at my previous job was &lt;a href="http://bestpractical.com/rt/"&gt;RT Task Tracker&lt;/a&gt;, a web-based tool to record tickets and often used for keeping track of helpdesk tickets. I found it a bit too big and having too many features for my own purposes however and decided to write some little application myself that would do just what I need: the &lt;span style="font-weight: bold;"&gt;Simple Project Logger&lt;/span&gt; (&lt;a href="http://sprolog.rubyforge.org/"&gt;sprolog&lt;/a&gt;, I can plug this in my own blog, right?). Mind that although I use it at the moment it's still in alpha and full of bugs.&lt;br /&gt;&lt;br /&gt;The main &lt;span style="font-weight: bold;"&gt;requirements of the recording workflow&lt;/span&gt; are:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;If you get data from somewhere/someone else than yourself, record &lt;span style="font-weight: bold;"&gt;where/whom you got the data&lt;/span&gt; from. Of course, there might be updates of the files you downloaded from that FTP server a couple of months ago even though those files have the same name. To be able to tell afterwards, &lt;span style="font-weight: bold;"&gt;md5sums&lt;/span&gt; should be made of any downloaded files and files that were sent to you by email.&lt;/li&gt;&lt;li&gt;Any &lt;span style="font-weight: bold;"&gt;mangling of the data should be recorded&lt;/span&gt;. Stuff like "my_script.rb &lt;&gt; output.txt" and "grep 'abc' input.txt &gt; output.txt".&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;So how does that work &lt;span style="font-weight: bold;"&gt;in practice&lt;/span&gt;?&lt;br /&gt;The sprolog application I wrote has the concepts of &lt;span style="font-weight: bold;"&gt;project, task and step&lt;/span&gt;. A project is a, well, a project. For example: "build my house" or "sprolog". A task is some distinct thing you have to do within the project, e.g. "place the roof" or "add authentication". Each task is then completed by a number of steps ("phoned contractor", "installed acts_as_authenticated").&lt;br /&gt;When starting a new project, I give that project its own subdirectory under ~/Documents/Projects/. In turn, each task gets its own subdirectory within that project, named using the following convention: date + sprolog ticket number + short description (e.g. "20080513_T4-5_GenerateOligosForNewArray"). All work for that task is performed within that directory.&lt;br /&gt;While performing the work, I copy/paste all necessary steps in sprolog. Typical steps look like this:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;Step -&gt; Recorded at Tue May 13 15:42:37 +0100 2008:&lt;br /&gt;&lt;br /&gt;Saved attachment of John Doe as #{project_dir}/his_data.xls&lt;br /&gt;&lt;br /&gt;Extracted tab-delimited version for each chromosome, changed newlines and added # before header.&lt;br /&gt;&lt;br /&gt;MD5 (his_data_chr1.txt) = e5c38a91d8e5a666488863099fc5ef1c&lt;br /&gt;MD5 (his_data_chr10.txt) = 9a702fb1f31bec42ec87089fc77efcc5&lt;br /&gt;MD5 (his_data_chr11.txt) = e1a9a63e5c016cf93cb08ea6a5e425e5&lt;br /&gt;MD5 (his_data_chr12.txt) = 8ab2bf7032f56df93b8b10c78bc2e1d4&lt;br /&gt;MD5 (his_data_chr13.txt) = c1b2d609956edcf80657ed5f90b9469c&lt;br /&gt;MD5 (his_data_chr14.txt) = fa6bfda1cd4e76f797ed8bd88d508448&lt;br /&gt;MD5 (his_data_chr15.txt) = 46dbd8de0916dd69e81c519ac05671fe&lt;br /&gt;MD5 (his_data_chr16.txt) = 302d920c6bef199a4bf40cfa2171348f&lt;br /&gt;MD5 (his_data_chr17.txt) = 9aee0113c96f919c0603da3ccb9fca44&lt;br /&gt;&lt;snip&gt;&lt;br /&gt;MD5 (his_data_chr8.txt) = e0d38c6804e39cae883dedfc648a2cda&lt;br /&gt;MD5 (his_data_chr9.txt) = 94dfc7abc08d2b143f4eb13f29cadbdb&lt;br /&gt;MD5 (his_data_chrUn.txt) = 5275595d7dfd4d4eb664e6bc9b08398c&lt;br /&gt;MD5 (his_data_chrX.txt) = d90ba7f40b1019e1bbf981d894268dbc&lt;br /&gt;MD5 (his_data_chrY.txt) = b991ff6f92869dcdc7b39da71d4d4b16&lt;br /&gt;&lt;br /&gt;Step -&gt; Recorded at Tue May 13 15:58:43 +0100 2008:&lt;br /&gt;&lt;br /&gt;Venter: email boss gives the conditions on how to select deletions in reference genome.&lt;br /&gt;&lt;br /&gt;Just to make sure I've understand correctly, if I want to identify&lt;br /&gt;features for which there is &gt;1kb of non-N sequence for which the&lt;br /&gt;reference sequence has the allele then I identify all sequences in the Excel file and filter on those that have &gt;1000 non-N bases.&lt;br /&gt;&lt;br /&gt;Wrote script filter_records.rb to run this filter.&lt;br /&gt;&lt;br /&gt;Step -&gt; Recorded at Wed May 14 10:41:53 +0100 2008:&lt;br /&gt;&lt;br /&gt;Renamed filter_records.csv to filter_records_on_non_n_bases.rb&lt;br /&gt;&lt;br /&gt;ruby ./filter_records_on_non_n_bases.rb &gt; filtered_records.csv&lt;br /&gt;&lt;br /&gt;Number of lines in output file: 4411&lt;br /&gt;&lt;br /&gt;Next step: repeatmasking&lt;br /&gt;&lt;br /&gt;Problem: we don’t have the HuRef sequences, so those have to be downloaded first.&lt;br /&gt;&lt;br /&gt;Downloaded assembled HuRef chromosomes from ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/ into ~/Documents/DataRepository/HuRef/&lt;br /&gt;&lt;br /&gt;Step -&gt; Recorded at Wed May 14 11:05:26 +0100 2008:&lt;br /&gt;&lt;br /&gt;MD5 (hs_alt_HuRef_chr1.fa.gz) = 684e628536fa87b96343f1fea6219328&lt;br /&gt;MD5 (hs_alt_HuRef_chr10.fa.gz) = 02ec433e2b00811db98c77a2fff3d161&lt;br /&gt;MD5 (hs_alt_HuRef_chr11.fa.gz) = cca2a7098ed4d706dc8af7c58a2b9807&lt;br /&gt;MD5 (hs_alt_HuRef_chr12.fa.gz) = 244bfd0f3f26cc109132c5518b2a1fb3&lt;br /&gt;...&lt;br /&gt;&lt;/snip&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Lab journal&lt;/span&gt;&lt;br /&gt;Once a considerable number of steps is performed or a task is completed, I print them out and &lt;span style="font-weight: bold;"&gt;glue them into a paper lab journal&lt;/span&gt;. That might look a waste of paper and completely unnecessary because you've got everything in electronic format anyway. I might change that behaviour later, but for the moment I just like to browse through physical pages when I need to know what I did rather then having to look at a screen. It's also easy to add annotations on those paper pages as well.&lt;br /&gt;&lt;br /&gt;Note: if anyone is interested in helping develop sprolog, please let me know.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;UPDATE&lt;/span&gt;: sprolog is now hosted on github at &lt;a href="http://github.com/jandot/sprolog"&gt;http://github.com/jandot/sprolog&lt;/a&gt;. Development on rubyforge will stop. Get your own copy by cloning it:&lt;br /&gt;&lt;code&gt;git clone git://github.com/jandot/sprolog.git&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;font-size:85%;" &gt;Note: picture taken from http://www.flickr.com/photos/cdnphoto/301083106/&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-7208240780907157861?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/7208240780907157861/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2008/05/keeping-track-of-things-using-labbook.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/7208240780907157861'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/7208240780907157861'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2008/05/keeping-track-of-things-using-labbook.html' title='Keeping track of things: using a labbook for bioinformatics'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm4.static.flickr.com/3186/2335457878_866c7686fa_t.jpg' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-5106306692547061495</id><published>2008-03-12T17:57:00.010+01:00</published><updated>2008-03-12T19:48:26.425+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='data management'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><title type='text'>Where did I get that data from?</title><content type='html'>Did you ever have data lying around that you couldn't figure out where you got it from?&lt;br /&gt;&lt;br /&gt;You downloaded and imported data from an FTP site into your database ages ago and you actually want to use it now. But if different records come from different sources, it can be really challenging to know what data to trust or how to retrieve additional information afterwards. Not keeping track of the source of the data breaks the chain of &lt;span style="font-weight: bold;"&gt;provenance&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;I've seen it happen.&lt;br /&gt;&lt;br /&gt;So what do you do? The obvious thing when you're working with a database is to have an &lt;span&gt;additional column&lt;/span&gt; for that table where you can store the FTP URL or the people who sent you the data. Easy peasy. Things get a lot more complicated if &lt;span style="font-weight: bold;"&gt;different cells in a single row can have different sources&lt;/span&gt;. For example: suppose one of your tables (called markers) contains information on STSs.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;id  name      fw_primer     rev_primer    PCR product length   gene&lt;br /&gt;---------------------------------------------------------------------&lt;br /&gt;1   marker_1  AACCGGACGA    GACCTCGGAGAC         241           CYP2D6&lt;br /&gt;2   marker_2  TCAATGGAGG    GATTCGCTGACTC        183           BRCA2&lt;br /&gt;3   marker_3  CGCTATGACTGC  AACTGCGTCATG         221           DAG1&lt;br /&gt;4    ...         ...          ...                ...           ...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Let's say that you get the primers and STS length from a couple of inputs (e.g. dbSTS as well as designed by colleagues: marker_1 can come from dbSTS while the other two were created by your colleague Tim) and the gene information was added by two different colleagues (let's say Bert and Pat). In this case it becomes quite impossible to have that information stored in an additional column of the table ("primers: dbSTS; gene: Pat"??).&lt;br /&gt;&lt;br /&gt;I've discussed this issue with a database guy during the last Perl Programming course at CSHL, and guess what: there's no easy solution. Having it play in the back of my head for the last couple of months, I finally took the effort to actually draw some possibilities on a piece of paper. And here's what might do the trick: just add an &lt;span style="font-weight: bold;"&gt;additional table&lt;/span&gt;. Let's call it marker_sources. This table has exactly the same columns as the markers table (apart from a foreign key to the data table).&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;marker_id  name      fw_primer     rev_primer    PCR product length   gene&lt;br /&gt;---------------------------------------------------------------------------&lt;br /&gt;   1       dbSTS      dbSTS          dbSTS            dbSTS           Pat&lt;br /&gt;   2        Tim        Tim            Tim              Tim            Pat&lt;br /&gt;   3        Tim        Tim            Tim              Tim            Bert&lt;br /&gt;   4        ...         ...          ...                ...           ...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Result: the resolution of your provenance increases up to single cell level. This &lt;span style="font-style: italic;"&gt;does&lt;/span&gt; mean additional tables, but for typical use they will not have to be queried if you're just interrogating the data. But at least you can get to that information if needs be...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-5106306692547061495?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/5106306692547061495/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2008/03/where-did-i-get-that-data-from.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/5106306692547061495'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/5106306692547061495'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2008/03/where-did-i-get-that-data-from.html' title='Where did I get that data from?'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-3734939999959280508</id><published>2008-02-26T10:20:00.011+01:00</published><updated>2008-02-26T11:12:02.678+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='technical'/><category scheme='http://www.blogger.com/atom/ns#' term='testing'/><title type='text'>Testing small scripts</title><content type='html'>Seasoned programmers know this: &lt;span style="font-weight: bold;"&gt;testing&lt;/span&gt; should be an integral part of developing any script/program/software suite. Part and parcel is the &lt;span style="font-weight: bold;"&gt;unit test&lt;/span&gt;, where you test every little aspect of your program little by little.&lt;br /&gt;&lt;br /&gt;For larger projects using a bunch of library files, the setup for testing basically always looks the same: there's your /lib/ directory with your class definitions and your /test/unit/ directory which holds yours tests. See &lt;a href="http://www.ruby-doc.org/stdlib/libdoc/test/unit/rdoc/classes/Test/Unit.html"&gt;here&lt;/a&gt; and &lt;a href="http://en.wikibooks.org/wiki/Ruby_Programming/Unit_Testing"&gt;here&lt;/a&gt; for introductions on full-blown unit tests.&lt;br /&gt;&lt;br /&gt;That's all nice and fine, but as a bioinformatician you often just write &lt;span style="font-weight: bold;"&gt;small scripts&lt;/span&gt; for which it would be way to much hassle to create those different directories and separate files containing the classes from those containing the tests. So what do we do?&lt;br /&gt;&lt;br /&gt;Often, you end up running your program and looking out for part of the output that you know should be correct. Let's take a very simple example. Suppose we have a file with just one column that has numbers in it. The same number can occur multiple times, and the ground-breaking script you'll write will just count the occurrences of each. Even though there are thousands of lines, you know from visual inspection that there are 7 1's and 15 2's. So a script could look like this:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;occurrences = Hash.new(0)&lt;br /&gt;File.open('data.txt').each do |number|&lt;br /&gt;  number.chomp!&lt;br /&gt;  occurrences[number.to_i] += 1&lt;br /&gt;end&lt;br /&gt;occurrences.keys.each do |k|&lt;br /&gt;  puts k.to_s + "\t" + occurrences[k].to_s&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;So you'd run the script and check if you get the expected values for 1 and 2. If not: revise, rerun and check again.&lt;br /&gt;&lt;br /&gt;But this looks like something that would be ideally suited for a unit test, if it weren't for the fact that it'd be too much hassle creating those different files and all. What if we could put the testing code in the script itself?&lt;br /&gt;&lt;br /&gt;Actually, with a few adjustments, that's not a problem. Look at the following version of the code.&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;class Parser&lt;br /&gt;  attr_accessor :occurrences&lt;br /&gt;  def run&lt;br /&gt;    @occurrences = Hash.new(0)&lt;br /&gt;    File.open('data.txt').each do |number|&lt;br /&gt;      number.chomp!&lt;br /&gt;      @occurrences[number.to_i] += 1&lt;br /&gt;    end&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;if ! $test&lt;br /&gt;  p = Parser.new&lt;br /&gt;  p.run&lt;br /&gt;  p.occurrences.keys.each do |k|&lt;br /&gt;    puts k.to_s + "\t" + p.occurrences[k].to_s&lt;br /&gt;  end&lt;br /&gt;else&lt;br /&gt; require 'test/unit'&lt;br /&gt;  class TestSimple &amp;lt; Test::Unit::TestCase&lt;br /&gt;    def test_simple&lt;br /&gt;      p = Parser.new&lt;br /&gt;      p.run&lt;br /&gt;      assert_equal(p.occurrences[1],7)&lt;br /&gt;      assert_equal(p.occurrences[2],15)&lt;br /&gt;    end&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;So &lt;span style="font-weight:bold;"&gt;what happened&lt;/span&gt; here?&lt;br /&gt;&lt;br /&gt;The original script, as so many scripts we write, actually does two things: (1) it parses a file to extract information, and (2) it prints some things out. In these cases, we can take the approach I outlined here. The biggest change you have to make, is to &lt;span style="font-weight:bold;"&gt;put your code in a class&lt;/span&gt;, otherwise you won't be able to run the unit test. Secondly, the &lt;span style="font-style:italic;"&gt;if ! $test&lt;/span&gt; separates out the behaviour of the code based on if you want it tested or just run. I'll explain this line later. But if the &lt;span style="font-style:italic;"&gt;if ! $test&lt;/span&gt; is true, the script just dumps the same output as the first version. However, when that statement is false, the script loads test/unit and runs two test: checking if the value for 1 is 7 and the value for 2 is 15.&lt;br /&gt;&lt;br /&gt;How does that &lt;span style="font-style:italic;"&gt;if ! $test&lt;/span&gt; work? If you call your script using &lt;br /&gt;&lt;pre name="code" class="bash"&gt;&lt;br /&gt;ruby -s my_script.rb -test&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;instead of&lt;br /&gt;&lt;pre name="code" class="bash"&gt;&lt;br /&gt;ruby my_script.rb&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;ruby will provide your script with an extra (global) variable: $test. See man ruby for more information.&lt;br /&gt;&lt;br /&gt;So with this approach you can use test-driven development also in your teenie weenie scripts and not just in your mammoth software suites.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-3734939999959280508?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/3734939999959280508/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2008/02/testing-your-code.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/3734939999959280508'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/3734939999959280508'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2008/02/testing-your-code.html' title='Testing small scripts'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-4086365412754962297</id><published>2008-02-06T12:39:00.000+01:00</published><updated>2008-02-06T12:48:53.887+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='technical'/><category scheme='http://www.blogger.com/atom/ns#' term='graphics'/><title type='text'>Making Bio::Graphics extendable</title><content type='html'>One of the issues in a library like &lt;a href="http://bio-graphics.rubyforge.org/"&gt;Bio::Graphics&lt;/a&gt;, is the plethora of glyph types that users will want. Here's a little showcase of what's provided by the library:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bio-graphics.rubyforge.org/images/glyph_showcase.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 200px;" src="http://bio-graphics.rubyforge.org/images/glyph_showcase.png" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Features on a DNA sequences can be represented as filled boxes, open boxes, boxes with arrows, lines, triangles, ... In this post, I'll show you (and remind myself) how I came to a version of the Bio::Graphics code that makes adding glyphs straightforward both by myself and the user. &lt;i&gt;WARNING&lt;/i&gt;: this post is going to be rather technical... Sorry about that.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;First pass&lt;/b&gt;&lt;br /&gt;Suppose we want to make it possible to create a picture like this one:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bio-graphics.rubyforge.org/images/example_labels.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 200px;" src="http://bio-graphics.rubyforge.org/images/example_labels.png" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;You basically have to tell your script that marker features should be drawn as triangles, and both scaffold and clone features as coloured boxes. The initial version of doing the actual drawing looked like this (only taking the relevant bits):&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;class Feature&lt;br /&gt;  def initialize(glyph = :generic)&lt;br /&gt;    @glyph = glyph&lt;br /&gt;  end&lt;br /&gt;  attr_accessor :glyph&lt;br /&gt;&lt;br /&gt;  def draw&lt;br /&gt;    case @glyph&lt;br /&gt;    when :generic&lt;br /&gt;      drawing.rectangle(left, top, width, height).fill&lt;br /&gt;    when :line&lt;br /&gt;      drawing.move_to(left,top)&lt;br /&gt;      drawing.line_to(right,top)&lt;br /&gt;      drawing.stroke&lt;br /&gt;    when :triangle&lt;br /&gt;      # code to draw triangle&lt;br /&gt;    end&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;This &lt;i&gt;does&lt;/i&gt; work, but you see the issue, right? Whenever I or someone else comes up with another idea on how to represent a particular feature, the library code itself has to be changed. So far from extendable, that is...&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Second pass: extracting the glyphs&lt;/b&gt;&lt;br /&gt;To handle this issue for perl's Bio::Graphics, Lincoln Stein uses the Factory pattern. Which means that he creates a single GlyphFactory object that spits out different Glyph objects for each feature based on the configuration set at the Feature level. As I didn't know a thing about Design Patterns (i.e. &lt;i&gt;before&lt;/i&gt; Russ Olsen's "&lt;a href="http://www.amazon.com/Design-Patterns-Ruby-Addison-Wesley-Professional/dp/0321490452/"&gt;Design Patterns in Ruby&lt;/a&gt;" arrived here at work) I had no idea how to set something up like that and just started coding away. As it turns out, I actually implemented it using a Strategy pattern.&lt;br /&gt;&lt;br /&gt;What I basically wanted, is to delegate the actual drawing of a feature to a glyph. The Design Patterns in Ruby book gives a good example for formatting text. Here's the code:&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;class XMLFormatter&lt;br /&gt;  def output_report(title, text)&lt;br /&gt;    puts('&amp;lt xml&amp;gt')&lt;br /&gt;    puts('  &amp;lt title&amp;gt#{title}&amp;lt /title&amp;gt')&lt;br /&gt;    puts('  &amp;lt text&amp;gt#{text}&amp;lt /text&amp;gt')&lt;br /&gt;    puts('&amp;lt /xml&amp;gt')&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;class PlainTextFormatter&lt;br /&gt;  def output_report(title, text)&lt;br /&gt;    puts("***** #{title} *****")&lt;br /&gt;    puts text&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;This can then be used in e.g. a Report class like this (also from the same book):&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;class Report&lt;br /&gt;  attr_reader :title, :text&lt;br /&gt;  attr_accessor :formatter&lt;br /&gt;&lt;br /&gt;  def initialize(formatter)&lt;br /&gt;    @title = 'Monthly Report'&lt;br /&gt;    @text = 'Things are going pretty well.'&lt;br /&gt;    @formatter = formatter&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  def output_report&lt;br /&gt;    @formatter.output_report(@title, @text)&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Looks a lot like what we need, isn't it? Translating this to our purposes, the library code could look like this:&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;class Glyph::Common&lt;br /&gt;  def initialize(caller)&lt;br /&gt;    @caller = caller&lt;br /&gt;  end&lt;br /&gt;  attr_accessor :caller&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;class Glyph::Generic &lt; Glyph::Common&lt;br /&gt;  def draw(left, right, width, height)&lt;br /&gt;    @caller.drawing.rectangle(left, top, width, height).fill&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;class Glyph::Line &lt; Glyph::Common&lt;br /&gt;  def draw(left, right, width, height)&lt;br /&gt;    @caller.drawing.move_to(left,top)&lt;br /&gt;    @caller.drawing.line_to(right,top)&lt;br /&gt;    @caller.drawing.stroke&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;And use it in the Feature class like this:&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;class Feature&lt;br /&gt;  def initialize(glyph_object = Glyph::Generic)&lt;br /&gt;    @glyph_object = glyph_object.new(self)&lt;br /&gt;  end&lt;br /&gt;  attr_accessor :glyph_object&lt;br /&gt;&lt;br /&gt;  def draw&lt;br /&gt;    @glyph_object.draw&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;At least this approach splits out the actual drawing into different simple classes. But the extendability still isn't there: the user still has to open the library file containing all glyph definitions and hack away in there.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Third pass: loading glyphs automatically&lt;/b&gt;&lt;br /&gt;It's be nice if we could add new glyph types on the fly just by creating a little file containing the code for that glyph's class. &lt;b&gt;Convention over configuration&lt;/b&gt; to the rescue...&lt;br /&gt;&lt;br /&gt;What I did, was create a folder (/lib/bio/graphics/glyphs/) that contains the description of all glyphs in separate files:&lt;br /&gt;&lt;i&gt;generic.rb&lt;/i&gt;&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;class Glyph::Generic &lt; Glyph::Common&lt;br /&gt;  def draw(left, right, width, height)&lt;br /&gt;    @caller.drawing.rectangle(left, top, width, height).fill&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;line.rb&lt;/i&gt;&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;class Glyph::Line &lt; Glyph::Common&lt;br /&gt;  def draw(left, right, width, height)&lt;br /&gt;    @caller.drawing.move_to(left,top)&lt;br /&gt;    @caller.drawing.line_to(right,top)&lt;br /&gt;    @caller.drawing.stroke&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;So ideally, the only thing to make a script work that asks for a feature to be drawn as a empty box (&lt;i&gt;feature = Feature.new(:empty_box)&lt;/i&gt;), would be to add a file to that directory called 'empty_box.rb'. Several things have to be taken care of to make that happen:&lt;br /&gt;* loading the new file&lt;br /&gt;* translating the :empty_box to EmptyBox&lt;br /&gt;&lt;br /&gt;To load all files in that directory is easy enough. Adding the following code to the main bio-graphics.rb file (which loads the whole library) does the trick:&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;glyph_dir = File.dirname(__FILE__) + '/bio/graphics/glyphs/'&lt;br /&gt;require glyph_dir + '/common.rb'&lt;br /&gt;full_pattern = File.join(glyph_dir, '*.rb')&lt;br /&gt;Dir.glob(full_pattern).each do |file|&lt;br /&gt;  require file&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;To translate the :empty_box symbol into the EmptyBox class takes a little more work: we need to convert the snake_case symbol into a CamelCase string, and then create an object of the class that has that name. To do that, I extended the String class a bit with these additional methods:&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;class String&lt;br /&gt;  def snake_case&lt;br /&gt;   return self.to_s.gsub(/::/, '/').gsub(/([A-Z]+)([A-Z][a-z])/,'\1_\2').gsub(/([a-z\d])([A-Z])/,'\1_\2').tr("-", "_").downcase&lt;br /&gt;  end&lt;br /&gt;    &lt;br /&gt;  def camel_case&lt;br /&gt;    return self.to_s.gsub(/\/(.?)/) { "::" + $1.upcase }.gsub(/(^|_)(.)/) { $2.upcase }.to_s.gsub(/\/(.?)/) { "::" + $1.upcase }.gsub(/(^|_)(.)/) { $2.upcase }&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  def to_class&lt;br /&gt;    parts = self.split(/::/)&lt;br /&gt;    klass = Kernel&lt;br /&gt;    parts.each do |part|&lt;br /&gt;      klass = klass.const_get(part)&lt;br /&gt;    end&lt;br /&gt;    return klass&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Now what happens here? The snake_case and camel_case methods should be not that difficult to understand and are not really where the magic happens. The String#to_class method however is a different story. As it happens, every class in ruby is also represented by a constant (the class name always start with a capital). To get to the class that has the name MyClass, all you have to do is retrieve the constant with that name: &lt;i&gt;Kernel.const_get("MyClass")&lt;/i&gt;. Unfortunately, having namespaces (Bio::Graphics::Glyph::Generic) makes things a bit difficult. You can't just do &lt;i&gt;Kernel.const_get("Bio::Graphics::Glyph::Generic")&lt;/i&gt;. To get to the Generic class, you have to call the const_get method on the Bio::Graphics::Glyph class, which doesn't exist yet. Therefore we have to look through all parts of the namespace and build up the class as we go.&lt;br /&gt;&lt;br /&gt;With this code in place, I rewrote the Feature class to use this functionality:&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;class Feature&lt;br /&gt;  def initialize(glyph = :generic)&lt;br /&gt;    @glyph = glyph&lt;br /&gt;  end&lt;br /&gt;  attr_accessor :glyph&lt;br /&gt;&lt;br /&gt;  def draw&lt;br /&gt;    glyph_name = 'Bio::Graphics::Glyph::' + glyph.to_s.camel_case&lt;br /&gt;    glyph_class = glyph_name.to_class&lt;br /&gt;    glyph = glyph_class.new(self)&lt;br /&gt;    glyph.draw&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Now all a user has to do to add a new glyph type to his application, is:&lt;br /&gt;* create a file in the lib/bio/graphics/glyphs/ directory that defines the glyph&lt;br /&gt;* make sure that the name he gives to that class is the CamelCase version of the symbol he wants to use (which should be snake_case)&lt;br /&gt;&lt;br /&gt;There you go. As I warned at the start: technical. At the moment this setup works for what I need the Bio::Graphics library to do. There might be a chance that the approach is changed in the future as we need to handle subfeatures, subsubfeatures, subsubsubfeatures, ... more elegantly. But thats' something for another post.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-4086365412754962297?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/4086365412754962297/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2008/02/making-biographics-extendable.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4086365412754962297'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4086365412754962297'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2008/02/making-biographics-extendable.html' title='Making Bio::Graphics extendable'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-6175920613140986429</id><published>2007-11-26T04:37:00.000+01:00</published><updated>2007-11-26T21:21:44.675+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><title type='text'>What makes code beautiful?</title><content type='html'>Saw &lt;a href="http://rubyhoedown2007.confreaks.com/session09.html"&gt;this webcast&lt;/a&gt; a couple of weeks ago where Marcel Molina explains the notion of &lt;span style="font-weight: bold;"&gt;beautiful code&lt;/span&gt;. And I really recommend anyone writing code to have a look at it (totally irrespective of the fact he uses a ruby example...). If you look at the progress bar of the presentation, it looks like it's really long, but only the first half is the actual presentation, followed by a lengthy discussion. (The last few minutes have a nice wrap-up by Chad Fowler.)&lt;br /&gt;&lt;br /&gt;The talk is basically split into two parts: (a) what is beauty, and (b) how does that apply to coding?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;What is beauty?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;After going into the history of what philosophers like Plate, Rousseau and others found beautiful, Molina focusses on the three rules set out by Thomas Aquinas: proportion, integrity and clarity. Things that are beautiful comply to these rules. The examples that Molina uses:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;proportion&lt;/span&gt;: suppose the fingers on your hand would be twice or half as long. Wouldn't really be beautiful, would it? Think '&lt;a href="http://goldennumber.net/hand.htm"&gt;golden ratio&lt;/a&gt;' here.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;integrity&lt;/span&gt;: a thing has to be fit for its purpose. Take for example a hammer made of crystal. It might &lt;span style="font-style: italic;"&gt;look&lt;/span&gt; beautiful, but can't do what it's supposed to. The appearance of beauty is not necessarily the same as real beauty.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;clarity&lt;/span&gt;: it should be clear what something is/does.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;How does it apply to coding?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;It's straightforward to apply these principles to programming code. First, let's take a look at &lt;span style="font-weight: bold;"&gt;proportion&lt;/span&gt;. The code for simple functionality should be short. Not long. To add two numbers should be a single call. Molina gives an example here of how that had to be done in assembly. Really &lt;span style="font-style: italic;"&gt;way&lt;/span&gt; too much coding going on to just do that. Here at work, the phrase "Jan can do that in 2 lines of ruby" are often heard (fortunately a little bit less, lately (pfew)). I feel that's not because I'm a code hacker, but because Ruby makes it simple to do simple things. Perl has the same characteristic, but at a price (I'll come back to that later).&lt;br /&gt;Secondly, there's &lt;span style="font-weight: bold;"&gt;integrity&lt;/span&gt;. Does your code do what it's supposed to do? That's the most widely-used notion of good code: do the tests run? But code that does what it has to do can be still be ugly if it doesn't comply to the other principles. Spaghetti, anyone?&lt;br /&gt;Thirdly, there's &lt;span style="font-weight: bold;"&gt;clarity&lt;/span&gt;. How much of your code can your fellow programmers understand by looking at it for less than a day? The fact itself that you have to explain your code might sometimes indicate that you're violating this rule.&lt;br /&gt;&lt;br /&gt;One quirk of applying these principles to programming code, is that there's a clear &lt;span style="font-weight: bold;"&gt;trade-off&lt;/span&gt; between them. Some hacked together code might be very proportionate to what it's supposed to be, but in doing so it might become obfuscated and therefore violate the clarity principle. I must say that I've often found it difficult to understand my own perl code after a couple of weeks of not looking at it. Most of this can be attributed to the fact that the language itself needs you to write less clear code ($_, the way objects are implemented, ...).&lt;br /&gt;&lt;br /&gt;When you see &lt;span style="font-style: italic;"&gt;smelly code&lt;/span&gt;, chances are it's one of these three that are the problem.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Why does this matter?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To paraphrase Rousseau: &lt;span style="font-weight: bold;"&gt;Good code is beautiful code in action&lt;/span&gt;. Creating beautiful code (by the above principles) is no guarantee for good code, but it's a pretty good start...&lt;br /&gt;&lt;br /&gt;Note: There's a &lt;a href="http://www.oreilly.com/catalog/9780596510046/"&gt;book on Beautiful Code&lt;/a&gt; as well, and apparently there's a chapter by Lincoln Stein on &lt;a href="http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Graphics.html"&gt;his Bio::Graphics library&lt;/a&gt;. Better get my hands on that book for ideas for the ruby version.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-6175920613140986429?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/6175920613140986429/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/10/what-makes-code-beautiful.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/6175920613140986429'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/6175920613140986429'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/10/what-makes-code-beautiful.html' title='What makes code beautiful?'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-1959665191405054990</id><published>2007-11-05T17:02:00.000+01:00</published><updated>2007-11-06T12:02:38.869+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><title type='text'>Named arguments in ruby</title><content type='html'>One of the main disadvantages of using ruby that I bump into is the absence of &lt;span style="font-weight: bold;"&gt;named arguments&lt;/span&gt; (or &lt;span style="font-weight: bold;"&gt;keyword parameters&lt;/span&gt;). That's no problem for methods taking just two or three arguments, but it does get confusing when you have to be able to pass more than that.&lt;br /&gt;&lt;br /&gt;For example, the &lt;a href="http://bio-graphics.rubyforge.org/classes/Bio/Graphics/Panel/Track/Feature.html#M000011"&gt;Bio::Graphics::Panel::Track::Feature#new&lt;/a&gt; method takes six arguments: the track it belongs to, a label, a type, the location, an array of subfeatures and a url. (Note: there's only four in the released version. Label and subfeatures are worked on at the moment...) When you get to this number of arguments, deciding what the &lt;span style="font-weight: bold;"&gt;order&lt;/span&gt; should be is almost more difficult than writing the actual code... You really have to start thinking about how the class will be used most often. If the user only has a track and a url, it's a bit unsightly to make him/her create a new feature by typing&lt;br /&gt;&lt;pre name="code" class="ruby:nocontrols"&gt;&lt;br /&gt;my_feature = Feature.new(my_track, nil, nil, nil, [], 'http://www.google.com')&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Compare that to&lt;br /&gt;&lt;pre name="code" class="ruby:nocontrols"&gt;&lt;br /&gt;my_feature = Feature.new(:track =&gt; my_track, :url =&gt; 'http://www.google.com')&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The way to get this to work, is to &lt;span style="font-weight:bold;"&gt;pass a hash&lt;/span&gt; in the method definition. There's a few gotchas, however. For instance, you can't define the &lt;span style="font-weight:bold;"&gt;defaults&lt;/span&gt; in the argument list directly.&lt;br /&gt;&lt;br /&gt;The following code snippet almost works, but not quite...&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;class MyClass&lt;br /&gt;  def initialize(options = {:name =&gt; 'unknown', :size =&gt; 0})&lt;br /&gt;    @name = options[:name]&lt;br /&gt;    @size = options[:size]&lt;br /&gt;  end&lt;br /&gt;  attr_accessor :name, :size&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;But you can set the defaults like this:&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;class MyClass&lt;br /&gt;  def initialize(options = {})&lt;br /&gt;    @name = options[:name] || 'unknown'&lt;br /&gt;    @size = options[:size] || 0&lt;br /&gt;  end&lt;br /&gt;  attr_accessor :name, :size&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;In addition, you have to check if all passed keys &lt;span style="font-weight:bold;"&gt;make sense&lt;/span&gt;. What if a user would use the MyClass#initialize method above with a :age key? It's easy enough to catch that in your #initialize definition, but you have to remember to do that...&lt;br /&gt;&lt;br /&gt;Another major disadvantage is the fact itself that you need a workaround in the first place. This means that it makes it a no-no if you want to integrate your classes in a bigger framework. There's no way I should use this approach for &lt;a href="http://bio-graphics.rubyforge.org"&gt;Bio::Graphics&lt;/a&gt; if it would be integrated with &lt;a href="http://www.bioruby.org"&gt;bioruby&lt;/a&gt; later. That would result in inconsistency in calling different classes within bioruby, which is the &lt;span style="font-style:italic;"&gt;last&lt;/span&gt; thing you want...&lt;br /&gt;&lt;br /&gt;Let's just hope ruby 2.0 will allow for using keyword parameters. (Pretty please...)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-1959665191405054990?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/1959665191405054990/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/11/named-arguments-in-ruby.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/1959665191405054990'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/1959665191405054990'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/11/named-arguments-in-ruby.html' title='Named arguments in ruby'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-2866940296759492348</id><published>2007-10-10T17:49:00.000+02:00</published><updated>2007-10-15T10:49:13.256+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioruby'/><title type='text'>The state of bioruby (or: how can bioruby grow?)</title><content type='html'>A number of people asked me recently about the usability of ruby/bioruby and if it would be worthwhile for them to take the plunge and investigate &lt;a href="http://www.bioruby.org/"&gt;bioruby&lt;/a&gt; more. So I thought writing up here would be a good idea...&lt;br /&gt;&lt;br /&gt;First a disclaimer: this is &lt;span style="font-weight: bold;"&gt;my own personal view&lt;/span&gt; on bioruby, based on experiences in the last year-and-a-half. In addition, this is about the &lt;span style="font-weight: bold;"&gt;bioruby project,&lt;/span&gt; &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; the code or the people.&lt;br /&gt;&lt;br /&gt;Let's first see what bioruby does. It's a library of ruby classes and modules that can be used in biological -omics research. Just like ruby itself, bioruby's origins lie in Japan. Version 0.5.0 was released in 2003 and we're at 1.1.0 now. A brief and incomplete overview of what's covered by the library:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;sequences&lt;/li&gt;&lt;li&gt;locations&lt;/li&gt;&lt;li&gt;pathways&lt;/li&gt;&lt;li&gt;alignments&lt;/li&gt;&lt;li&gt;trees&lt;/li&gt;&lt;li&gt;databases: GenBank, RefSeq, Ensembl, KEGG, ...&lt;br /&gt;&lt;/li&gt;&lt;li&gt;applications: fasta, BLAST, HMMER, clustalw, sim4, spidey, ...&lt;/li&gt;&lt;li&gt;...&lt;/li&gt;&lt;/ul&gt;Now why this post? Because I believe adoption and use of bioruby could be much improved. Is bioruby dead? Far from it. I think it's more like it is growing out of its clothes as any toddler does when it's getting older.&lt;br /&gt;&lt;br /&gt;So &lt;span style="font-weight: bold;"&gt;what's the problem&lt;/span&gt;? It's not the quality of the code. It's not that too much stuff is missing. It's a sub-optimal level of communication (between users, from users to core-developers and from core-developers to users) and the low visibility of the project.&lt;br /&gt;&lt;br /&gt;How can bioruby be taken forward? Somewhere in March, I sent some suggestions to the bioruby &lt;a href="http://lists.open-bio.org/pipermail/bioruby/2007-March/thread.html"&gt;mailing list&lt;/a&gt; in response to a post by a very frustrated Trevor Wennblom. What it basically boils down to, is to get organized and get bioruby much more to the fore-front. So what options exist?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic; font-weight: bold;"&gt;Getting bioruby organized&lt;/span&gt;&lt;br /&gt;First of all, it wouldn't be bad if there would be a (mixed American/European/Japanese) &lt;span style="font-weight: bold;"&gt;board&lt;/span&gt;-like little group of people (3 or 4) who would be able to take the executive decisions on releases and what new modules should be incorporated in the bioruby library (after discussions on the mailing list, obviously). This would take a lot of weight of the shoulders of Toshiaki Katayama who now has almost single responsibility (and stress) for this. Having this done by a small &lt;span style="font-style: italic;"&gt;group&lt;/span&gt; of people would relieve him from some of that stress.&lt;br /&gt;&lt;br /&gt;Secondly, we need something of a playground for &lt;span style="font-weight: bold;"&gt;experimental modules&lt;/span&gt;. Call it bioruby-edge or whatever. These could be any modules that are not ready to go into the core bioruby, but are already really useful. When they reach a good enough quality, they can be moved to bioruby itself. There is already a &lt;a href="http://rubyforge.org/projects/bioruby-annex"&gt;bioruby-annex&lt;/a&gt; project at rubyforge, but according to its description it's only meant to hold rails plugins. (However, I was told to put my &lt;a href="http://saaientist.blogspot.com/2007/09/graphics-genomics-and-ruby.html"&gt;Ensembl API&lt;/a&gt; there as well...)&lt;br /&gt;&lt;br /&gt;In addition, it would be good if the &lt;a href="http://rubyforge.org/projects/bioruby/"&gt;rubyforge project website&lt;/a&gt; would be used for &lt;span style="font-weight: bold;"&gt;feature requests and bug reports&lt;/span&gt;. This would then be the one-stop shop for the development.&lt;br /&gt;&lt;br /&gt;And of course there is the &lt;span style="font-weight: bold;"&gt;documentation&lt;/span&gt;. I think we did a good thing in that big push to document the API in 2006, but the community needs more. The bioruby website already hosts the &lt;a href="http://dev.bioruby.org/wiki/en/?BioRuby+in+Anger"&gt;BioRuby in Anger&lt;/a&gt; documentation written by Toshiaki and Pjotr Prins. That's great stuff for quick lookups and I often use that information (especially the sequence IO; I never can remember. Must be an APOE4 mutation.). It would be nice though if it would be worked out a bit more. Take a look at the BioPerl documentation. I've always found the &lt;a href="http://www.bioperl.org/wiki/HOWTOs"&gt;howto&lt;/a&gt;'s really helpfull: getting a bit deeper into how the code works as well.&lt;br /&gt;The rubyforge system provides &lt;span style="font-weight: bold;"&gt;wiki&lt;/span&gt; functionality for its projects, which is apparently not activated in bioruby. There &lt;span style="font-style: italic;"&gt;is&lt;/span&gt; &lt;a href="http://bioruby-doc.org/"&gt;bioruby-doc&lt;/a&gt; maintained by Trevor, but I think it would be good to keep the core things together: put the wiki on rubyforge.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Letting people know about bioruby&lt;/span&gt;&lt;br /&gt;Tremendous things are happening in the bioruby code (e.g. the rails thing), but &lt;span style="font-style: italic;"&gt;we just don't know about it&lt;/span&gt;. Let alone what might be in store for the future. In addition, we know that the library exists, but the community who uses it has up till now not been really talkative about what they used it for. What we need here is &lt;span style="font-weight: bold;"&gt;communication&lt;/span&gt; in all directions.&lt;br /&gt;First of all: a &lt;span style="font-weight: bold;"&gt;paper&lt;/span&gt; in a medium/high-profile journal. And sooner rather than later. This could serve as the starting block for building a wider bioruby community.&lt;br /&gt;&lt;br /&gt;Secondly, we need to let each other know &lt;span style="font-style: italic;"&gt;what&lt;/span&gt; we're doing with bioruby and &lt;span style="font-style: italic;"&gt;how&lt;/span&gt; we're using it. I'm talking &lt;span style="font-weight: bold;"&gt;blogs&lt;/span&gt; and &lt;span style="font-weight: bold;"&gt;social networks&lt;/span&gt; here. I hope the blog you're reading at the moment might be a small contribution. Both the end-users and the core-developers should get their thoughts and work out in the open. Reports by &lt;span style="font-weight: bold;"&gt;end-users&lt;/span&gt; will keep the developers a bit on their toes and can highlight things that can be improved in bioruby. &lt;span style="font-weight: bold;"&gt;Core-developers&lt;/span&gt; could on the other hand shed a light on what's in store for bioruby in the future. How do you yourself use the code? Are you contemplating something great? We'd like to know what you're planning... I got a reply from Toshiaki about writing a blog, and he mentioned that it's not straightforward to do that in English. I do understand that that's a hurdle, but what I'd say to everyone having that issue: no problem. So let it be English "with hair on" (ooh, hair-rising-on-my-back translated literally from Dutch, but you hopefully get the idea). It's about us getting the big picture. Not about reading poetry. If we get the meaning, that's the main point.&lt;br /&gt;&lt;br /&gt;The core-developers have done and are doing a great job. Respect. The only thing is that this toddler is now grown enough to want to play outside and will need additional clothes for that.&lt;br /&gt;&lt;br /&gt;Ruby has so much to offer for bioinformatics as it has tremendous functionality and is yet so simple to code in. It would be a shame if the bioinformatics community can not capitalize on that.&lt;br /&gt;&lt;br /&gt;Of course, I'd be very interested in your comments. Let's start talking! Especially about how to start that social network.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Note: while writing this entry, there were actually two messages sent to the bioruby mailing list asking for more documentation and easier access to new users (October 10, 2007). One of these messages stated "...BioRuby docs should have a version of more readable/easy-to-use format for beginners apart from the API stuff". Quod erat demonstrandum.&lt;br /&gt;&lt;br /&gt;Second note: there's been some comments on the mailing list about the fact that this post was too much of a criticism to the original contributors to bioruby. That's not what I intended to do. Instead, it was my intention to look at options on how to take bioruby forward and let it grow from its small niche today to a more widely accepted toolkit. I've changed some phrasing in the text to hopefully make sure that that intention is clear (including changing the title).&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-2866940296759492348?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/2866940296759492348/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/10/state-of-bioruby.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/2866940296759492348'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/2866940296759492348'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/10/state-of-bioruby.html' title='The state of bioruby (or: how can bioruby grow?)'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-6811666915741818848</id><published>2007-10-09T11:28:00.000+02:00</published><updated>2007-11-06T21:31:33.739+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='productivity'/><category scheme='http://www.blogger.com/atom/ns#' term='organization'/><title type='text'>Using rake to manage your software project</title><content type='html'>Do you have some of those projects where you have to be sure that you jump the same loops every time you edit some code? Take a look at the &lt;a href="http://saaientist.blogspot.com/2007/09/graphics-genomics-and-ruby.html"&gt;bio-graphics&lt;/a&gt; code. Every time I change anything in the code, I have to do the following things:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;regenerate the RDoc documentation&lt;/li&gt;&lt;li&gt;regenerate the ruby gem&lt;/li&gt;&lt;li&gt;check SVN status&lt;/li&gt;&lt;li&gt;do an SVN update&lt;/li&gt;&lt;li&gt;perform the SVN commit&lt;/li&gt;&lt;li&gt;upload the new documentation to the website&lt;/li&gt;&lt;/ol&gt;That's a prime candidate for rake. &lt;a style="font-weight: bold;" href="http://rake.rubyforge.org/"&gt;Rake&lt;/a&gt; does the same as &lt;a href="http://www.gnu.org/software/make/manual/make.html"&gt;GNU make&lt;/a&gt;, which is &lt;span style="font-weight: bold;"&gt;dependency-based programming&lt;/span&gt;. The major advantage for us over GNU make is of course that it uses ruby syntax. With dependency-based programming, I mean that some tasks rely on other ones. GNU make is best know for managing the compilation of source files. But you can do other stuff with it as well: if I want to commit to SVN, I want to make sure that the latest RDoc has been generated as well as a new gem. Therefore, you can have the 'SVN commit' task &lt;span style="font-style: italic;"&gt;depend&lt;/span&gt; on the 'generate RDoc' and 'generate gem' tasks. And the task 'generate RDoc' will &lt;span style="font-style: italic;"&gt;depend&lt;/span&gt; on the freshness of the actual library files.&lt;br /&gt;&lt;br /&gt;How's this work? You basically create a file containing tasks and tell rake to execute one or more of them, the &lt;span style="font-weight: bold;"&gt;Rakefile&lt;/span&gt;. There are several good tutorials on rake, like the one from &lt;a href="http://martinfowler.com/articles/rake.html"&gt;Martin Fowler&lt;/a&gt; and from the &lt;a href="http://www.railsenvy.com/2007/6/11/ruby-on-rails-rake-tutorial"&gt;Rails Envy&lt;/a&gt; guys. I'm not going into the nitty-gritty of how they're written. These tutorials are much better at that. What I &lt;span style="font-style: italic;"&gt;will&lt;/span&gt; do here, is describe the Rakefile I use for Bio::Graphics. (Someone already asked in the comments on my post on using &lt;a href="http://saaientist.blogspot.com/2007/06/databases-and-ruby-without-rails.html"&gt;ActiveRecord outside of rails&lt;/a&gt; what the Rakefile was that I used. Actually, the one mentioned in that post was empty and just a place holder.)&lt;br /&gt;&lt;br /&gt;Without further ado, here it is:&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;# &lt;br /&gt;# Rakefile.rb&lt;br /&gt;# &lt;br /&gt;# Copyright (C)::   Jan Aerts &lt;jan.aerts@bbsrc.ac.uk&gt;&lt;br /&gt;# License::         The Ruby License&lt;br /&gt;# &lt;br /&gt;require 'rake'&lt;br /&gt;require 'rake/testtask'&lt;br /&gt;require 'rake/rdoctask'&lt;br /&gt;&lt;br /&gt;task :default =&gt; :svn_commit&lt;br /&gt;&lt;br /&gt;file_list = Dir.glob("lib/**/*.rb")&lt;br /&gt;&lt;br /&gt;desc "Create RDoc documentation"&lt;br /&gt;file 'doc/index.html' =&gt; file_list do&lt;br /&gt;  puts "######## Creating RDoc documentation"&lt;br /&gt;  system "rdoc --title 'Bio::Graphics documentation' -m TUTORIAL TUTORIAL README.DEV lib/"&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;desc "An alias for creating the RDoc documentation"&lt;br /&gt;task :rdoc do&lt;br /&gt;  Rake::Task['doc/index.html'].invoke&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;desc "Create a new gem"&lt;br /&gt;file 'bio-graphics-1.0.gem' =&gt; file_list do&lt;br /&gt;  puts "######## Creating new gem"&lt;br /&gt;  system "gem build bio-graphics.gemspec"&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;desc "An alias for creating the gem"&lt;br /&gt;task :create_gem do&lt;br /&gt;  Rake::Task['bio-graphics-1.0.gem'].invoke&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;desc "Check SVN status"&lt;br /&gt;task :check_svn_status do&lt;br /&gt;  puts "######## Checking SVN status"&lt;br /&gt;  message = String.new&lt;br /&gt;  message &lt;&lt; "# SVN status requires manual intervention\n"&lt;br /&gt;  message &lt;&lt; "# For items with '?': either svn add or svn propedit svn:ignore\n"&lt;br /&gt;  message &lt;&lt; "# For items with '~': don't know yet\n"&lt;br /&gt;  message &lt;&lt; "# Please see http://svnbook.red-bean.com/en/1.4/svn-book.html#svn.ref.svn.c.status"&lt;br /&gt;&lt;br /&gt;  output = `svn status`&lt;br /&gt;  puts output&lt;br /&gt;&lt;br /&gt;  allowed_status = ['A','D','M','R','X','I'] # See http://svnbook.red-bean.com/en/1.4/svn-book.html#svn.ref.svn.c.status&lt;br /&gt;&lt;br /&gt;  output.each do |line|&lt;br /&gt;    status = line.slice(0,1)&lt;br /&gt;    if ! allowed_status.include?(status)&lt;br /&gt;      raise message&lt;br /&gt;    end&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;desc "Check if SVN updates available"&lt;br /&gt;task :check_svn_update do&lt;br /&gt;  puts "######## Checking SVN update"&lt;br /&gt;  output = `svn update`&lt;br /&gt;  puts output&lt;br /&gt;  if output !~ /^At revision [0-9]/&lt;br /&gt;    raise "Please update your working copy first"&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;desc "Commit to SVN repository"&lt;br /&gt;task :svn_commit =&gt; [:check_svn_update, :check_svn_status, :create_gem, :rdoc] do&lt;br /&gt;  puts "######## Doing SVN commit"&lt;br /&gt;  system 'svn commit'&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;rake -T&lt;/span&gt; lists all &lt;span style="font-weight: bold;"&gt;available tasks&lt;/span&gt;:&lt;br /&gt;&lt;pre&gt;rake bio-graphics-1.0.gem  # Create a new gem&lt;br /&gt;rake check_svn_status      # Check SVN status&lt;br /&gt;rake check_svn_update      # Check if SVN updates available&lt;br /&gt;rake create_gem            # An alias for creating the gem&lt;br /&gt;rake doc/index.html        # Create RDoc documentation&lt;br /&gt;rake rdoc                  # An alias for creating the RDoc documentation&lt;br /&gt;rake svn_commit            # Commit to SVN repository&lt;/pre&gt;The &lt;span style="font-style: italic;"&gt;file_list&lt;/span&gt; at the top contains all files in the library itself, and will be used to check all &lt;span style="font-weight: bold;"&gt;timestamps&lt;/span&gt;. The &lt;span style="font-style: italic;"&gt;file 'doc/index.html'&lt;/span&gt; task looks at the timestamp of the index.html file and if it's older than any of the files in &lt;span style="font-style: italic;"&gt;file_list&lt;/span&gt;, it will regenerate the documentation. If it's newer, nothing happens. Same goes for bio-graphics-1.0.gem.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-style: italic;"&gt;check_svn_update&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;check_svn_status&lt;/span&gt; tasks just check if subversion needs some manual intervention before being able to commit. This should be able to catch conflicts in the working copy and the repository, or files that you forgot to add the SVN.&lt;br /&gt;&lt;br /&gt;Note: why didn't I use the special &lt;span style="font-weight: bold;"&gt;Rake::RDocTask&lt;/span&gt; instead of the one I use here? Because the built-in RDoc task first removes your whole &lt;span style="font-style: italic;"&gt;doc&lt;/span&gt; directory, also deleting the subversion metadata contained in it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-6811666915741818848?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/6811666915741818848/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/10/using-rake-to-manage-your-software.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/6811666915741818848'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/6811666915741818848'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/10/using-rake-to-manage-your-software.html' title='Using rake to manage your software project'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-9172254439157746022</id><published>2007-09-27T10:57:00.000+02:00</published><updated>2007-12-17T15:00:58.212+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='rails'/><category scheme='http://www.blogger.com/atom/ns#' term='graphics'/><category scheme='http://www.blogger.com/atom/ns#' term='ensembl'/><title type='text'>Bio::Graphics and rails</title><content type='html'>As a follow up to my post on &lt;a href="http://saaientist.blogspot.com/2007/09/graphics-genomics-and-ruby.html"&gt;Bio::Graphics&lt;/a&gt;, I tried integrating this library in a &lt;span style="font-weight: bold;"&gt;rails&lt;/span&gt; application. After all, you'd get your data either from a file (like GFF) or a database. And let me tell you: it took me just 30 minutes or so to get a proof-of-concept running. This included installing rails itself, creating the rails app, creating the database, loading dummy data, and doing the coding itself. That 30 minutes was interrupted for a couple of hours, because I needed some advice from Kouhei Sutou, the author of &lt;a href="http://cairographics.org/rcairo/"&gt;rcairo&lt;/a&gt;, on how to write PNG images in memory instead of to a file.&lt;br /&gt;&lt;br /&gt;So how do you do it? The proof-of-concept little &lt;span style="font-weight: bold;"&gt;database&lt;/span&gt; I created contained 3 tables:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;chromosomes (columns: id, name, length)&lt;/li&gt;&lt;li&gt;tracks (columns: id, name, glyph, colour)&lt;/li&gt;&lt;li&gt;features (columns: id, chromosome_id, track_id, name, location, url)&lt;/li&gt;&lt;/ul&gt;Create some features for a couple of different tracks for a particular chromosome.&lt;br /&gt;&lt;br /&gt;In &lt;span style="font-weight: bold;"&gt;views/chromosomes/show.rhtml&lt;/span&gt;, add the following line:&lt;br /&gt;&lt;pre name="code" class="rails"&gt;&lt;br /&gt;&lt;%= @chromosome.to_png %&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;My &lt;span style="font-weight: bold;"&gt;models/chromosome.rb&lt;/span&gt; looks like this:&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;require 'stringio'&lt;br /&gt;require 'base64'&lt;br /&gt;require_gem 'bio-graphics'&lt;br /&gt;&lt;br /&gt;class Chromosome &lt; through =""&gt; :features&lt;br /&gt;&lt;br /&gt;def to_png(width = 800, start = 1, stop = self.length)&lt;br /&gt; return %{&lt;img src="data:image/png;base64,{Base64.encode64(self.draw(width,start,stop))}"&gt;}&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;def draw(width, start, stop)&lt;br /&gt; panel = Bio::Graphics::Panel.new(self.length, width, false, start, stop)&lt;br /&gt; track_container = Hash.new&lt;br /&gt; self.tracks.each do |track|&lt;br /&gt;   if ! track_container.has_key?(track.name)&lt;br /&gt;     track_container[track.name] = panel.add_track(track.name, track.colour.split(',').collect{|i| i.to_i}, track.glyph)&lt;br /&gt;   end&lt;br /&gt; end&lt;br /&gt;&lt;br /&gt; self.features.each do |feature|&lt;br /&gt;   track_container[feature.track.name].add_feature(feature.name, feature.location)&lt;br /&gt; end&lt;br /&gt;&lt;br /&gt; output = StringIO.new&lt;br /&gt; panel.draw(output)&lt;br /&gt; return output.string&lt;br /&gt;end&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;UPDATE&lt;/span&gt;: Apparently, Blogger does not allow me to paste the correct code above. In the to_png method, replace the following ascii codes:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;%7B with {&lt;/li&gt;&lt;br /&gt;&lt;li&gt;%28 with (&lt;/li&gt;&lt;br /&gt;&lt;li&gt;%29 with )&lt;/li&gt;&lt;br /&gt;&lt;li&gt;%7D with }&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;And that's it. I leave the integration of my ensembl-api, bio-graphics and rails as an exercise for the reader. We could make a ruby version of the &lt;a href="http://www.ensembl.org/Homo_sapiens/contigview?seq_region_right=247249719&amp;amp;seq_region_name=1&amp;amp;click_right=490&amp;amp;click_left=40&amp;amp;seq_region_left=1&amp;amp;seq_region_width=100000&amp;amp;vclick.x=113&amp;amp;vclick.y=155"&gt;Ensembl browser&lt;/a&gt;... and then: world domination. Mwahaha.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-9172254439157746022?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/9172254439157746022/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/09/biographics-and-rails.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/9172254439157746022'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/9172254439157746022'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/09/biographics-and-rails.html' title='Bio::Graphics and rails'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-1970622132054307180</id><published>2007-09-20T14:03:00.001+02:00</published><updated>2008-12-15T09:51:54.919+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='rails'/><category scheme='http://www.blogger.com/atom/ns#' term='graphics'/><category scheme='http://www.blogger.com/atom/ns#' term='ensembl'/><title type='text'>Graphics, genomics and ruby</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/RvO_Ksv9XNI/AAAAAAAAAC8/8JhXCs0YYJk/s1600-h/my_panel.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://1.bp.blogspot.com/_t6Ob1J7aZ0A/RvO_Ksv9XNI/AAAAAAAAAC8/8JhXCs0YYJk/s400/my_panel.png" alt="" id="BLOGGER_PHOTO_ID_5112640192527555794" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Having known and used the &lt;span style="font-weight: bold;"&gt;Generic Genome Browser&lt;/span&gt; (aka gbrowse, see &lt;a href="http://www.gmod.org/wiki/index.php/Gbrowse"&gt;here&lt;/a&gt;) for years now, it occured to me a while ago that it should be o so simple to create the same functionality with a much easier setup if we could use ruby instead of perl.&lt;br /&gt;&lt;br /&gt;Gbrowse depends on &lt;a href="http://www.bioperl.org/"&gt;bioperl&lt;/a&gt;'s &lt;span style="font-weight: bold;"&gt;Bio::Graphics&lt;/span&gt; module. Although gbrowse has been instrumental for many people's research, it does take a bit of work to get it installed. Apart from bioperl, it depends on Apache for showing the results in a browser. Compare that to any &lt;span style="font-weight: bold;"&gt;Rails&lt;/span&gt; application, where you basically just need ruby and a "gem install rails". I've created rails applications in the past that contain exactly the kind of data that would typically be visualized by something like gbrowse. Takes no time at all to set up and you can even get away by virtually writing no code. And no Apache to be installed, or configuration files that you can't access because you're not root.&lt;br /&gt;&lt;br /&gt;Such a rails application makes it possible to browse, edit and delete the data. The problem comes with the visualization bit. There's &lt;span style="font-weight: bold;"&gt;no bioruby graphics library&lt;/span&gt; (yet?) that automatically parses features on a reference and creates a nice picture of where your genes are on that chromosome. Of course, the genes should be clickable so you can link through to NCBI or Ensembl.&lt;br /&gt;&lt;br /&gt;I've spend some time in the last year creating such a Bio::Graphics thing for ruby. I wanted it to behave the same as the one from bioperl: there one &lt;span style="font-weight: bold;"&gt;panel&lt;/span&gt; that has one or more &lt;span style="font-weight: bold;"&gt;track&lt;/span&gt;s, and each track has &lt;span style="font-weight: bold;"&gt;feature&lt;/span&gt;s on it. Even though it was quite easy to create a proof-of-concept library, the most difficult part was actually finding the right &lt;span style="font-weight: bold;"&gt;backend&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;What should I use to create the pictures themselves? As I'd worked with SVG before, that seemed the right way to go. Downloaded a library from http://raa.ruby-lang.org/project/ruby-svg/ and got a prototype running quite easily. Problem: I needed an SVG viewer or firefox to actually view the picture, and zooming in/out screwed up all text. So after weeks of digging around, I've found rcairo, a ruby-binding to &lt;span style="font-weight: bold;"&gt;Cairo&lt;/span&gt;. Migrating to this backend was easy peasy and the pictures look really nice (see at the top). Unfortunately, it's impossible to create clickable glyphs using Cairo itself, but that can be easily worked around by creating a html file with the map. That's exactly what gbrowse does as well, isn't it?&lt;br /&gt;&lt;br /&gt;The picture at the top has been created using the following simple script:&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;g = BioExt::Graphics::Panel.new(800, 1200, true, 1, 610)&lt;br /&gt;&lt;br /&gt;track1 = g.add_track('generic')&lt;br /&gt;track2 = g.add_track('directed',[0,1,0],'directed_generic')&lt;br /&gt;track3 = g.add_track('triangle',[0.5, 0.5, 0.5],'triangle')&lt;br /&gt;track4 = g.add_track('spliced',[1,0,0],'spliced')&lt;br /&gt;track5 = g.add_track('directed_spliced',[1,0,1],'directed_spliced')&lt;br /&gt;&lt;br /&gt;track1.add_feature('bla1','250..375', 'http://www.newsforge.com')&lt;br /&gt;track1.add_feature('bla2','54..124', 'http://www.thearkdb.org')&lt;br /&gt;track1.add_feature('bla3','100..449', 'http://www.google.com')&lt;br /&gt;&lt;br /&gt;track2.add_feature('bla4','50..60', 'http://www.google.com')&lt;br /&gt;track2.add_feature('bla5','complement(80..120)', 'http://www.sourceforge.net')&lt;br /&gt;&lt;br /&gt;track3.add_feature('piep','56')&lt;br /&gt;track3.add_feature('bla','103', 'http://digg.com')&lt;br /&gt;&lt;br /&gt;track4.add_feature('gene1','join(34..52,109..183)','http://news.bbc.co.uk')&lt;br /&gt;track4.add_feature('gene2','complement(join(170..231,264..299,350..360,409..445))')&lt;br /&gt;track4.add_feature('gene3','join(134..152,209..283)')&lt;br /&gt;&lt;br /&gt;track5.add_feature('gene1','join(34..52,109..183)', 'http://www.vrtnieuws.net')&lt;br /&gt;track5.add_feature('gene2','complement(join(170..231,264..299,350..360,409..445))','http://www.roslin.ac.uk')&lt;br /&gt;track5.add_feature('gene3','join(134..152,209..283)')&lt;br /&gt;&lt;br /&gt;g.draw('my_panel.png')&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;What happens here?&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Line 1&lt;/span&gt;: Create a new panel for a sequence of 800 bp, with the picture being 1200 points wide. Make all glyphs clickable if a URL is defined (the &lt;span style="font-style: italic;"&gt;true&lt;/span&gt;), and zoom into the region from 1 to 610 bp.&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Lines 3-6&lt;/span&gt;: Create different tracks, each with a name, a colour (in RGB at the moment) and a type.&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Lines 8-24&lt;/span&gt;: Add features to those tracks, each with a name, a locus and an optional URL to link out to external websites. Notice how it handles spliced features and features on the reverse strand?&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Line 26&lt;/span&gt;: Create the PNG (and in this case: also HTML) file.&lt;br /&gt;&lt;br /&gt;Here's a nicer way to produce the same type of output:&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;#Initialize graphic for a nucleotide sequence of 600 bp&lt;br /&gt;my_panel = BioExt::Graphics::Panel.new(1000, 1200, false, 1, 600)&lt;br /&gt;&lt;br /&gt;#Create and configure tracks&lt;br /&gt;track_SNP = my_panel.add_track('SNP')&lt;br /&gt;track_gene = my_panel.add_track('gene')&lt;br /&gt;track_transcript = my_panel.add_track('transcript')&lt;br /&gt;&lt;br /&gt;track_SNP.feature_colour = [1,0,0]&lt;br /&gt;track_SNP.feature_glyph = 'triangle'&lt;br /&gt;track_gene.feature_glyph = 'directed_spliced'&lt;br /&gt;track_transcript.feature_glyph = 'spliced'&lt;br /&gt;track_transcript.feature_colour = [0,0.5,0]&lt;br /&gt;&lt;br /&gt;# Add data to tracks&lt;br /&gt;DATA.each do |line|&lt;br /&gt; line.chomp!&lt;br /&gt; ref, type, name, location, link = line.split(/\s+/)&lt;br /&gt; if link == ''&lt;br /&gt;   link = nil&lt;br /&gt; end&lt;br /&gt; if type == 'SNP'&lt;br /&gt;   track_SNP.add_feature(name, location, link)&lt;br /&gt; elsif type == 'gene'&lt;br /&gt;   track_gene.add_feature(name, location, link)&lt;br /&gt; elsif type == 'transcript'&lt;br /&gt;   track_transcript.add_feature(name, location, link)&lt;br /&gt; end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;# And draw&lt;br /&gt;my_panel.draw('my_panel.png')&lt;br /&gt;&lt;br /&gt;__END__&lt;br /&gt;chr1  gene        CYP2D6      complement(80..120)&lt;br /&gt;chr1  gene        ALDH        100..449&lt;br /&gt;chr1  SNP         rs1234      107&lt;br /&gt;chr1  gene        bla         complement(400..430)&lt;br /&gt;chr1  SNP         rs9876      44&lt;br /&gt;chr1  gene        some_gene   complement(join(170..231,264..299,350..360,409..445))&lt;br /&gt;chr1  transcript  transcript1 join(250..300,390..425)&lt;br /&gt;chr1  transcript  transcript2 253..330&lt;br /&gt;chr1  transcript  transcript3 266..344&lt;br /&gt;chr1  transcript  transcript4 complement(join(410..430,239..286,129..151))&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;If someone would actually be interested in getting the library behind this, just let me know. It should be really easy to incorporate this in a rails application where the data are actually stored in a database.&lt;br /&gt;&lt;br /&gt;I wonder what if any role _why's &lt;a href="http://hackety.org/2007/08/02/mashingInSomeGraphics.html"&gt;Shoes&lt;/a&gt; thing would/could play...&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;UPDATE&lt;/span&gt;: This library has now been improved a bit and is hosted on &lt;span style="font-weight: bold;"&gt;rubyforge&lt;/span&gt;. You can find a tutorial and the whole API documentation at &lt;a href="http://bio-graphics.rubyforge.org/"&gt;http://bio-graphics.rubyforge.org&lt;/a&gt;. You can find instructions on how to install and use it over there.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;UPDATE TWO&lt;/span&gt;: Forget the previous update. I have moved the bio-graphics code to &lt;span style="font-weight: bold;"&gt;github&lt;/span&gt;. See &lt;a href="http://github.com/jandot/bio-graphics"&gt;http://github.com/jandot/bio-graphics&lt;/a&gt;. That should make it much easier to fork the code and get more input from other developers.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-1970622132054307180?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/1970622132054307180/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/09/graphics-genomics-and-ruby.html#comment-form' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/1970622132054307180'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/1970622132054307180'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/09/graphics-genomics-and-ruby.html' title='Graphics, genomics and ruby'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_t6Ob1J7aZ0A/RvO_Ksv9XNI/AAAAAAAAAC8/8JhXCs0YYJk/s72-c/my_panel.png' height='72' width='72'/><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-6410148109417932019</id><published>2007-09-04T15:08:00.000+02:00</published><updated>2007-11-06T14:27:23.655+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='ActiveRecord'/><title type='text'>ActiveRecord - all vs all relationships</title><content type='html'>&lt;span style="font-weight: bold;"&gt;Modeling genetics or genomics data&lt;/span&gt; presents its own challenges. One of the issues is that the actual &lt;span style="font-weight: bold;"&gt;definition&lt;/span&gt; of things change over time. A database system can only be based on the scientific knowledge at the time of conception. The prime example of course is the definition of a &lt;span style="font-style: italic;"&gt;gene&lt;/span&gt; over the years. Before 1997, it was believed that the vast majority of these encoded proteins. As a result, 'genes' tables in databases typically had columns to store information on the start and stop codon. However, it became clear that many genes actually do not encode proteins, forcing the remodeling of biological databases. But that's not the topic of this post.&lt;br /&gt;&lt;br /&gt;What &lt;span style="font-style: italic;"&gt;is&lt;/span&gt; the topic here, is how &lt;span style="font-weight: bold;"&gt;relationships&lt;/span&gt; can be stored in a database. Suppose I want to store mapping data: markers mapped to linkage groups, clones mapped to physical maps, ... Markers are stored in a &lt;span style="font-style: italic;"&gt;markers&lt;/span&gt; table, clones are stored in a &lt;span style="font-style: italic;"&gt;clones&lt;/span&gt; table, linkage groups in a &lt;span style="font-style: italic;"&gt;linkage_groups&lt;/span&gt; table; you get the point.&lt;br /&gt;&lt;br /&gt;The database that I'm working with at the moment (and only have read-access to), stores the mappings in a &lt;span style="font-style: italic;"&gt;mappings&lt;/span&gt; table which includes the following columns:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;map_type&lt;/li&gt;&lt;li&gt;map_name&lt;/li&gt;&lt;li&gt;mapped_object_type&lt;/li&gt;&lt;li&gt;mapped_object_name&lt;/li&gt;&lt;/ul&gt;So records could look like:&lt;br /&gt;&lt;pre&gt; map_type       map_id  map_name      mapped_object_type  mapped_object_id  mapped_object_name&lt;br /&gt;--------------+-------+-------------+-------------------+-----------------+------------------&lt;br /&gt;chromosome     1       chromosome_1  marker              1                 marker_A&lt;br /&gt;chromosome     1       chromosome_1  marker              2                 marker_B&lt;br /&gt;physical_map   2       ctg1          clone               1                 clone_A&lt;br /&gt;physical_map   3       ctg2          clone               2                 clone_B&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;To make things worse, markers can also be mapped to clones. This means that any clone can act as a marker, but also as a map at the same time.&lt;br /&gt;&lt;pre&gt; map_type       map_id  map_name      mapped_object_type  mapped_object_id  mapped_object_name&lt;br /&gt;--------------+-------+-------------+-------------------+-----------------+------------------&lt;br /&gt;clone          1       clone_A       marker              1                 marker_A&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;How can I model this in ActiveRecord? There's the concept of &lt;span style="font-weight: bold;"&gt;polymorphisms&lt;/span&gt; in ruby, which could solve this relationship nightmare if there would be only one thing in the mappings table that's polymorphic. But as it happens, there's &lt;span style="font-style: italic;"&gt;two&lt;/span&gt;... Evan Weaver wrote this rails plugin &lt;a href="http://blog.evanweaver.com/files/doc/fauna/has_many_polymorphs/files/README.html"&gt;has_many_polymorphs&lt;/a&gt;, which should do the trick (see &lt;a href="http://m.onkey.org/2007/8/14/excuse-me-wtf-is-polymorphs"&gt;here&lt;/a&gt; for a tutorial and background if it's unclear what I'm talking about). Unfortunately, as it is focussed on rails and not on ActiveRecord in general, it doesn't handle &lt;span style="font-weight: bold;"&gt;namespaces&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;So here's what I've come up with:&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;module MyNameSpace&lt;br /&gt;  class Mapping &lt; ActiveRecord::Base&lt;br /&gt;    # Relationships to feature-like things&lt;br /&gt;    belongs_to :marker, :foreign_key =&gt; 'mapped_object_id', :conditions =&gt; ["mapped_object_type = 'marker'"]&lt;br /&gt;    belongs_to :clone, :foreign_key =&gt; 'mapped_object_id', :conditions =&gt; ["mapped_object_type = 'clone'"]&lt;br /&gt;&lt;br /&gt;    # Relationships to map-like things&lt;br /&gt;    belongs_to :chromosome, :foreign_key =&gt; 'map_id', :conditions =&gt; ["map_type = 'chromosome'"]&lt;br /&gt;    belongs_to :physical_map, :foreign_key =&gt; 'map_id', :conditions =&gt; ["map_type = 'physical_map'"]&lt;br /&gt;    belongs_to :clone, :foreign_key =&gt; 'map_id', :conditions =&gt; ["map_type = 'clone'"]&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  class Marker &lt; ActiveRecord::Base&lt;br /&gt;    has_many :mappings_as_feature, :class_name =&gt; 'Mapping', :foreign_key =&gt; 'mapped_object_id', :conditions =&gt; "mapped_object_type = 'marker'"&lt;br /&gt;    has_many :chromosomes, :through =&gt; :mappings_as_feature&lt;br /&gt;    has_many :clones, :through =&gt; :mappings_as_feature&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  class Chromosome &lt; ActiveRecord::Base&lt;br /&gt;    has_many :mappings_as_map, :class_name =&gt; 'Mapping', :foreign_key =&gt; 'map_id', :conditions =&gt; "map_type = 'chromosome'"&lt;br /&gt;    has_many :markers, :through =&gt; :mappings_as_map&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  class PhysicalMap &lt; ActiveRecord::Base&lt;br /&gt;    has_many :mappings_as_map, :class_name =&gt; 'Mapping', :foreign_key =&gt; 'map_id', :conditions =&gt; "map_type = 'physical_map'"&lt;br /&gt;    has_many :clones, :through =&gt; :mappings_as_map&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  class Clone &lt; ActiveRecord::Base&lt;br /&gt;    # Relationships where the clone is the feature&lt;br /&gt;    has_many :mappings_as_feature, :class_name =&gt; 'Mapping', :foreign_key =&gt; 'mapped_object_id', :conditions =&gt; "mapped_object_type = 'clone'"&lt;br /&gt;    has_many :physical_maps, :through =&gt; :mappings_as_feature&lt;br /&gt;&lt;br /&gt;    # Relationships where the clone is the map&lt;br /&gt;    has_many :mappings_as_map, :class_name =&gt; 'Mapping', :foreign_key =&gt; 'map_id', :conditions =&gt; "map_type = 'clone'"&lt;br /&gt;    has_many :markers, :through =&gt; :mappings_as_map&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The key here is to make the distinguish between mappings_as_feature and mappings_as_map. A marker object can only have mappings where it acts as a feature, while a clone can both have mappings where it acts as a feature and where it acts as a map.&lt;br /&gt;&lt;br /&gt;Using this code, it's now possible to do:&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;clone = Clone.find_by_name('clone_A')&lt;br /&gt;puts clone.mappings_as_map.to_yaml&lt;br /&gt;puts clone.mappings_as_feature.to_yaml&lt;br /&gt;puts clone.markers.to_yaml&lt;br /&gt;puts clone.physical_maps.to_yaml&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Voila (until further notice...).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;UPDATE&lt;/span&gt;: Pratik blogged about has_many_polymorphs and lists the generated associations &lt;a href="http://m.onkey.org/2007/8/14/excuse-me-wtf-is-polymorphs"&gt;here&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-6410148109417932019?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/6410148109417932019/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/09/activerecord-all-vs-all-relationships.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/6410148109417932019'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/6410148109417932019'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/09/activerecord-all-vs-all-relationships.html' title='ActiveRecord - all vs all relationships'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-1508905487735265409</id><published>2007-08-14T14:36:00.001+02:00</published><updated>2008-12-11T17:54:06.149+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='ActiveRecord'/><category scheme='http://www.blogger.com/atom/ns#' term='ensembl'/><title type='text'>A ruby API to the Ensembl database</title><content type='html'>"Joy to the world, lalaa la laaaa." I can finally announce that I've released the &lt;span style="font-weight: bold;"&gt;ruby API to the Ensembl core database&lt;/span&gt; under the bioruby-annex umbrella. Go &lt;a href="http://bioruby-annex.rubyforge.org/"&gt;here&lt;/a&gt; for the release.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;What is it?&lt;/span&gt;&lt;br /&gt;The &lt;a style="font-weight: bold;" href="http://www.ensembl.org/"&gt;Ensembl&lt;/a&gt; database stores &lt;span style="font-weight: bold;"&gt;genetic and genomic data&lt;/span&gt; on a variety of species: sequences of chromosomes and positions of features such as genes and polymorphisms. This data is browseable using their genome browser, but is also directly accessible if you connect to their mysql database. A perl API to that database has been available from the start and is used by the ensembl people themselves to handle the data. A java implementation (called Ensj) is also available, but I don't know the status of that one. The ruby version should provide similar functionality to the perl API, albeit for querying only and not for writing to the database.&lt;br /&gt;&lt;br /&gt;This API is aimed at the core database. Ensembl also provides the variation and compara databases, but these are not the focus of the current API implementation.&lt;br /&gt;&lt;br /&gt;A minimal interface to the data of Ensembl was already available through Mitsuteru Nakao's ensembl.rb library in the &lt;a href="http://www.bioruby.org/"&gt;bioruby&lt;/a&gt; project, and is based on the exportview functionality of Ensembl's web interface. Although &lt;span style="font-style: italic;"&gt;very&lt;/span&gt; useful, it does not give the full functionality that can be achieved by accessing the database directly.&lt;br /&gt;&lt;br /&gt;The ruby API basically provides two things: access to the data in the database, and transformations of those data.&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Access to the data&lt;/span&gt;. (Virtually) all tables of the database are available through ActiveRecord, with all the automated query methods associated with that ('find_by_anything_you_like'). Say you want to get the object of a transcript with stable_id "ENST00000380593", you'd do&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;transcript = Ensembl::Core::Transcript.find_by_stable_id('ENST00000380593')&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Transformations of the data&lt;/span&gt;. You might have the coordinates of a gene on the chromosome, but actually want them on a contig or supercontig. This is where the Sliceable#transform and Slice#project methods come in. In contrast to the perl API, there is no Sliceable#transfer method, because my interpretation of a 'slice' is slightly different from the perl implementation. Read the &lt;a href="http://bioruby-annex.rubyforge.org/"&gt;tutorial&lt;/a&gt; for more information.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Minimal script&lt;/span&gt;&lt;br /&gt;Any script using the API would have to these steps:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;require the library&lt;/li&gt;&lt;li&gt;include the Ensembl::Core namespace (not strictly necessary, but saves typing)&lt;/li&gt;&lt;li&gt;connect to the database&lt;/li&gt;&lt;li&gt;start doing stuff&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;So for example:&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;#!/usr/bin/ruby&lt;br /&gt;require 'rubygems'&lt;br /&gt;require_gem 'ensembl-api'&lt;br /&gt;&lt;br /&gt;include Ensembl::Core&lt;br /&gt;&lt;br /&gt;CoreDBConnection.connect('homo_sapiens')&lt;br /&gt;&lt;br /&gt;transcript = Transcript.find_by_stable_id('ENST00000380593')&lt;br /&gt;puts "5'UTR: " + transcript.five_prime_utr_seq&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;How to install&lt;/span&gt;&lt;br /&gt;The API has been released as a gem file, which you can either download from the &lt;a href="http://rubyforge.org/projects/bioruby-annex/"&gt;website&lt;/a&gt; and install using the command&lt;br /&gt;&lt;pre&gt;gem install ensembl-api-0.9.gem&lt;/pre&gt;&lt;br /&gt;, or export from the SubVersion repository using the command&lt;br /&gt;&lt;pre&gt;svn export svn://rubyforge.org/var/svn/bioruby-annex/ensembl-api&lt;/pre&gt;This gem depends on bioruby and ActiveRecord.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;UPDATE: The code has been moved from rubyforge to github. Get it from http://github.com/jandot/ruby-ensembl-api&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Documentation&lt;/span&gt;&lt;br /&gt;Check the &lt;a href="http://bioruby-annex.rubyforge.org/"&gt;website&lt;/a&gt; at rubyforge, which will show the tutorial (based on the perl version) and the rdoc documentation. In addition, there are the tests in your gem directory, plus a sample script that shows all functionality of the perl-version of the API called examples_perl_tutorial.rb.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Credits&lt;/span&gt;&lt;br /&gt;I owe a lot to the Ensembl core team for helping me out when I was at the Ensembl site as a "Geek for a Week"...&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Call for help&lt;/span&gt;&lt;br /&gt;If anyone would be interested in improving the API, don't hesitate to contact me. At the moment, for example, projections between coordinate systems only work if they're directly linked in the assembly table, and projections of the haplotype assembly_exceptions will now raise a NotImplementedError error. In addition, it would be very useful if we could add the variation and compara databases to the API.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-1508905487735265409?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/1508905487735265409/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/08/ruby-api-to-ensembl-database.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/1508905487735265409'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/1508905487735265409'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/08/ruby-api-to-ensembl-database.html' title='A ruby API to the Ensembl database'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-8903766013340969076</id><published>2007-07-30T14:05:00.000+02:00</published><updated>2007-11-06T14:32:55.455+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='ActiveRecord'/><category scheme='http://www.blogger.com/atom/ns#' term='ensembl'/><title type='text'>ActiveRecord and mysql: show my databases</title><content type='html'>Working on a ruby API for the Ensembl databases, I bumped into the issue of having to &lt;span style="font-weight: bold;"&gt;connect to a database without knowing its name&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The ensembl database server hosts databases for each species. Every two months or so, there's a new release which means a new database for every single species. To see what databases are there, you can login to the ensembl server with mysql:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;mysql -h ensembldb.ensembl.org -u anonymous&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;(for more information, see &lt;a href="http://www.ensembl.org/info/software/registry/index.html"&gt;here&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;The command "&lt;span style="font-style: italic;"&gt;show databases;&lt;/span&gt;" on the mysql prompt lists a total of 1035 databases at the moment, a short selection looks like this:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;bos_taurus_core_41_2d&lt;br /&gt;bos_taurus_core_42_2e&lt;br /&gt;bos_taurus_core_43_3&lt;br /&gt;bos_taurus_core_44_3a&lt;br /&gt;bos_taurus_core_45_3b&lt;br /&gt;bos_taurus_est_36_2&lt;br /&gt;bos_taurus_est_37_2a&lt;br /&gt;homo_sapiens_core_36_35i&lt;br /&gt;homo_sapiens_core_37_35j&lt;br /&gt;homo_sapiens_core_38_36&lt;br /&gt;homo_sapiens_core_39_36a&lt;br /&gt;homo_sapiens_core_40_36b&lt;br /&gt;homo_sapiens_core_41_36c&lt;br /&gt;homo_sapiens_core_42_36d&lt;br /&gt;homo_sapiens_core_43_36e&lt;br /&gt;homo_sapiens_core_44_36f&lt;br /&gt;homo_sapiens_core_45_36g&lt;br /&gt;homo_sapiens_core_expression_est_34_35g&lt;br /&gt;homo_sapiens_core_expression_est_45_36g&lt;br /&gt;homo_sapiens_core_expression_gnf_34_35g&lt;br /&gt;homo_sapiens_core_expression_gnf_45_36g&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;To connect to the homo_sapiens_core_45_36g database, type "&lt;span style="font-style: italic;"&gt;use homo_sapiens_core_45_36g;&lt;/span&gt;" at the mysql prompt. However, as all 'core' databases have the same database schema, the API applies to all of these species, and just has to connect to different databases. But how do you go about doing that? What you &lt;span style="font-style: italic;"&gt;could&lt;/span&gt; do, is provide the full database name in the &lt;span style="font-style: italic;"&gt;establish_connection&lt;/span&gt; statement. But having to memorize these full names, or having to open mysql connections prior to writing your scripts is, to say the least, far from optimal. But how do you query a database system without connecting to a particular database?&lt;br /&gt;&lt;br /&gt;Basically, you make a connection to the host without specifying a database, and send the raw sql query "show databases;" over that connection. The code below does just that.&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;ENSEMBL_RELEASE = 45&lt;br /&gt;DB_ADAPTER = 'mysql'&lt;br /&gt;DB_HOST = 'ensembldb.ensembl.org'&lt;br /&gt;DB_USERNAME = 'anonymous'&lt;br /&gt;DB_PASSWORD = ''&lt;br /&gt;&lt;br /&gt;class DummyDBConnection &lt; ActiveRecord::Base&lt;br /&gt;  self.abstract_class = true&lt;br /&gt;&lt;br /&gt;  establish_connection(&lt;br /&gt;                      :adapter =&gt; DB_ADAPTER,&lt;br /&gt;                      :host =&gt; DB_HOST,&lt;br /&gt;                      :database =&gt; '',&lt;br /&gt;                      :username =&gt; DB_USERNAME,&lt;br /&gt;                      :password =&gt; DB_PASSWORD&lt;br /&gt;                    )&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;class CoreDBConnection &lt; ActiveRecord::Base&lt;br /&gt;  self.abstract_class = true&lt;br /&gt;&lt;br /&gt;  def self.connect(species)&lt;br /&gt;    db_name = DummyDBConnection.connection.select_values('show databases').select{|v| v =~ /#{species}_core_#{ENSEMBL_RELEASE.to_s}/}[0]&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;    if db_name.nil?&lt;br /&gt;      warn "WARNING: No connection to database established. Check that the species is in snake_case (was: #{species})."&lt;br /&gt;    else&lt;br /&gt;      establish_connection(&lt;br /&gt;                          :adapter =&gt; DB_ADAPTER,&lt;br /&gt;                          :host =&gt; DB_HOST,&lt;br /&gt;                          :database =&gt; db_name,&lt;br /&gt;                          :username =&gt; DB_USERNAME,&lt;br /&gt;                          :password =&gt; DB_PASSWORD&lt;br /&gt;                        )&lt;br /&gt;    end&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;And then just have your classes (e.g. CoordSystem, SeqRegion, Gene) inherit from CoreDBConnection instead of ActiveRecord::Base.&lt;br /&gt;&lt;br /&gt;To make the actual connection, start your script with:&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;CoreDBConnection.connect('bos_taurus')&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;I'm currently at Ensembl for a week ("Geek for a Week") to work on the full-blown ruby API, and am planning to give an introduction on how to use it in one of the later posts.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-8903766013340969076?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/8903766013340969076/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/07/activerecord-and-mysql-show-my.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/8903766013340969076'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/8903766013340969076'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/07/activerecord-and-mysql-show-my.html' title='ActiveRecord and mysql: show my databases'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-8621471167292493501</id><published>2007-07-25T21:23:00.000+02:00</published><updated>2007-07-26T14:26:50.925+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='productivity'/><category scheme='http://www.blogger.com/atom/ns#' term='organization'/><category scheme='http://www.blogger.com/atom/ns#' term='literature'/><title type='text'>How do you process literature?</title><content type='html'>A quick glance at the side of my desk reveals two stacks of manuscripts to read; each stack about 20cm high. Sounds familiar? There seems to be a major task in front of me to process all that.&lt;br /&gt;First thing to do is to identify what caused those piles in the first place. The answer: no system that I'm satisfied with for reference management. Of course, there is software like Reference Manager and EndNote as well as websites like Connotea and CiteULike. But they all have one major flaw: they are not suited to store the knowledge gained from those papers. Entering a reference to those papers in the software is not the same as going through them and extracting useful information. Sure, they do have a notes field where you can jot down some short remarks, but often knowledge is much easier recorded and remembered in little graphs and pictures than in words. There's &lt;span style="font-weight: bold;"&gt;reference management&lt;/span&gt;, and there's &lt;span style="font-weight: bold;"&gt;knowledge management&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;What do I want my system to look like? First of all, it should be &lt;span style="font-weight: bold;"&gt;searchable&lt;/span&gt;. The tagging system provided by CiteULike/Connotea seems good for that. Also (and this might seem illogical for a bioinformatician), the system should not be fully automatic or even electronic, but &lt;span style="font-weight: bold;"&gt;analog&lt;/span&gt;. Why? Just pressing a button to for example add the abstract from a paper to the system gives a sense of... what's the word in English: volatility? For some things you should use the help of a computer, and for some you shouldn't. There's a difference between using Excel to repeat the same calculation 50 times, and trying to use a pc for storing knowledge. It's &lt;span style="font-style: italic;"&gt;me&lt;/span&gt; who needs to store that knowledge, not the computer. If that was the case, I could always go back to Google instead of making the effort of using a reference manager in the first place. I've played around with &lt;a href="http://www.zotero.org/"&gt;zotero&lt;/a&gt; and personal &lt;a href="http://www.tiddlywiki.com/"&gt;wikis&lt;/a&gt; in the past, and they just didn't do the trick: I still ended up just copy-pasting the information instead of absorbing it.&lt;br /&gt;&lt;br /&gt;Another advantage of using an analog system, is that when you feel your productivity behind your computer is suboptimal, you can always take your cards, find yourself a quiet place, put your feet on a desk, and flick through the things you wrote down. Slippers and pipe are optional.&lt;br /&gt;&lt;br /&gt;During my PhD a few years ago, I used a system that was exclusively based on &lt;span style="font-weight: bold;"&gt;index cards&lt;/span&gt;. The inspiration came from Umberto Eco's book "&lt;span style="font-style: italic;"&gt;Come si fa una tesi di laurea&lt;/span&gt;" (Or "&lt;span style="font-style: italic;"&gt;How to make a doctoral thesis&lt;/span&gt;") (1977) in which he explains how he handles the knowledge for his book research. For each manuscript, I'd make a new card. The front contained an identifier, the paper title, full reference and keywords. On the back I'd write down what I had to remember from that paper, including little graphs, schemas and stuff. I've got to admit that a drawback of using these cards was that they were not easily searchable, but linking them worked quite well with a bit of discipline.&lt;br /&gt;During those years, I used the index card system both as reference manager and as knowledgebase. Although it did work to satisfaction, the role of reference manager should be fulfilled by a better tool.&lt;br /&gt;&lt;br /&gt;Now how could I implement something like that into a &lt;span style="font-weight: bold;"&gt;workflow&lt;/span&gt;? Basically, any new paper to be read should be entered in &lt;span style="font-weight: bold;"&gt;CiteULike&lt;/span&gt; and tagged as 'to_read'. When I've got time to read it: see if it's necessary to print out, or preferably read from the screen (we want to be nice to the trees, don't we?). When I've read the manuscript and there is interesting information to remember, &lt;span style="font-style: italic;"&gt;only &lt;/span&gt;&lt;span style="font-style: italic;"&gt;then&lt;/span&gt; create an &lt;span style="font-weight: bold;"&gt;index card&lt;/span&gt;. In case it's a landmark paper and/or I've been adding a lot of comments and markings in the text: keep the &lt;span style="font-weight: bold;"&gt;printout&lt;/span&gt; as well, and mark the index card that I've got that printout as well.&lt;br /&gt;Let's try this out for a few weeks and see where it goes...&lt;br /&gt;&lt;br /&gt;BTW: for a knowledgebase system based on index cards taken to the extreme, see &lt;a href="http://pileofindexcards.org/blog/cluster/"&gt;PoIC&lt;/a&gt; (Pile of Index Cards).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-8621471167292493501?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/8621471167292493501/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/07/how-do-you-process-literature.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/8621471167292493501'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/8621471167292493501'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/07/how-do-you-process-literature.html' title='How do you process literature?'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-7176484152588783389</id><published>2007-07-18T12:39:00.000+02:00</published><updated>2007-12-13T14:25:31.756+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='data management'/><title type='text'>Documenting one-off parsers</title><content type='html'>A lot of day-to-day work consists of &lt;span style="font-weight: bold;"&gt;parsing&lt;/span&gt; data files to transform the contents from one format into another or to create statistics. However, when you have to get back to those scripts at a later stage - you need something similar in another project or you notice that something along the way must have gone horribly wrong - it can often be quite hard to figure out what the script actually did. Having meaningless script filenames like &lt;span style="font-style: italic;"&gt;ParseBlast.rb&lt;/span&gt; doesn't help either. (Parse BLAST into &lt;span style="font-style: italic;"&gt;what&lt;/span&gt;?)&lt;br /&gt;&lt;br /&gt;I must say that things improved a lot when I switched from &lt;span style="font-weight: bold;"&gt;perl&lt;/span&gt; to &lt;span style="font-weight: bold;"&gt;ruby&lt;/span&gt;. Trying to understand a perl script that I wrote a couple of weeks earlier was a real pain, while I normally have no problems understanding ruby code that I wrote months ago... Has a lot to do with the simple syntax and expressiveness (is that a word?) of that language.&lt;br /&gt;&lt;br /&gt;But just understanding &lt;span&gt;the code&lt;/span&gt; is not enough. To be able to assess if you could use a script from an earlier project in a new one, you often also have to get hold of information in that script that is not inherently encoded in its code: what was the project? What did the input look like? What did the output have to look like? That's where the &lt;span style="font-weight: bold;"&gt;script documentation&lt;/span&gt; comes in, because having that information easily available greatly reduces the time you need to assess if you should copy-paste-adapt the script, or just start from scratch.&lt;br /&gt;&lt;br /&gt;What I try to do, is always use the &lt;span style="font-weight: bold;"&gt;same template&lt;/span&gt; when starting a new parsing script. Even when the parsing itself would only take 5 lines, I try to use this whole template. A quick walk-through:&lt;br /&gt;&lt;br /&gt;Lines 2-32: The &lt;span&gt;documentation&lt;/span&gt;, consisting of:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;script name&lt;/li&gt;&lt;li&gt;short &lt;span&gt;usage&lt;/span&gt; message&lt;br /&gt;&lt;/li&gt;&lt;li&gt;a description of what the script does&lt;/li&gt;&lt;li&gt;list of &lt;span&gt;arguments &lt;/span&gt;that can be used&lt;/li&gt;&lt;li&gt;input format&lt;/li&gt;&lt;li&gt;output format&lt;/li&gt;&lt;li&gt;contact information&lt;/li&gt;&lt;/ul&gt;Lines 38-40: Definitions of &lt;span&gt;classes &lt;/span&gt;that are used later on in the script itself&lt;br /&gt;Lines 42-51: Parsing the &lt;span&gt;arguments &lt;/span&gt;to the script. If the user (i.e. &lt;span style="font-style: italic;"&gt;me&lt;/span&gt; when I run it) uses &lt;span style="font-family:Courier;"&gt;--help&lt;/span&gt; or an argument that does not exist, he automatically gets the documentation of the script. If I would use the &lt;span style="font-family:Courier;"&gt;-a&lt;/span&gt; tag here, the output to the screen would be&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;USAGE&lt;br /&gt;-----&lt;br /&gt;./this_script.rb [ -h | --help ]&lt;br /&gt;                [ -i | --infile | &lt; ] your_input.txt                  [ -o | --outfile | &gt; your_output.txt ]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Lines 53-58: Logging of the options that were used in running the script.&lt;br /&gt;Lines 60-70: Create the input and output streams.&lt;br /&gt;Lines 72-75: The actual parsing code. This is the bit that does the work (using the classes described in lines 38-40).&lt;br /&gt;Lines 77-79: Clean up.&lt;br /&gt;&lt;br /&gt;Here's the complete template.&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;#!/usr/bin/ruby&lt;br /&gt;# == NAME&lt;br /&gt;# this_script.rb&lt;br /&gt;#&lt;br /&gt;# == USAGE&lt;br /&gt;#  ./this_script.rb [ -h | --help ]&lt;br /&gt;#                   [ -i | --infile | &lt; ] your_input.txt #                   [ -o | --outfile | &gt; your_output.txt ]&lt;br /&gt;#&lt;br /&gt;# == DESCRIPTION&lt;br /&gt;# Description of what this script does...&lt;br /&gt;#&lt;br /&gt;# == OPTIONS&lt;br /&gt;#  -h,--help::                 Show help&lt;br /&gt;#  -i,--infile=INFILE::        Name of input file. STDIN if not defined.&lt;br /&gt;#  -o,--outfile=OUTFILE::      Name of output file. STDOUT if not defined.&lt;br /&gt;#&lt;br /&gt;# == FORMAT INPUT&lt;br /&gt;#   &gt;gi|4531835|bla&lt;br /&gt;#   ACTTACCGACCGACTGACTACTTATGCCA&lt;br /&gt;#   &gt;gi|4861534|blabla&lt;br /&gt;#   CTACCCCATCTACCGGGGCTCGACT&lt;br /&gt;#   ...&lt;br /&gt;#&lt;br /&gt;# == FORMAT OUTPUT&lt;br /&gt;#   4531835    29&lt;br /&gt;#   4861534    25&lt;br /&gt;#   ...&lt;br /&gt;#&lt;br /&gt;# == AUTHOR&lt;br /&gt;#   my full contact information&lt;br /&gt;&lt;br /&gt;require 'rdoc/usage'&lt;br /&gt;require 'optparse'&lt;br /&gt;require 'ostruct'&lt;br /&gt;require 'logger'&lt;br /&gt;&lt;br /&gt;### Define classes here&lt;br /&gt;class MyClass&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;### Get the script arguments and open relevant files&lt;br /&gt;options = OpenStruct.new()&lt;br /&gt;opts = OptionParser.new()&lt;br /&gt;opts.on("-h","--help",&lt;br /&gt;       "Display the usage information") {RDoc::usage}&lt;br /&gt;opts.on("-i","--infile", "=INFILE",&lt;br /&gt;       "Input file name") {|argument| options.infile = argument}&lt;br /&gt;opts.on("-o","--outfile", "=OUTFILE",&lt;br /&gt;       "Output file name") {|argument| options.outfile = argument}&lt;br /&gt;opts.parse! rescue RDoc::usage('usage')&lt;br /&gt;&lt;br /&gt;log = Logger.new(File.new('this_script.log', File::WRONLY | File::TRUNC | File::CREAT))&lt;br /&gt;log_level = Logger::INFO # or: DEBUG, WARN, FATAL, UNKNOWN&lt;br /&gt;&lt;br /&gt;log.info('Script this_script.rb started')&lt;br /&gt;log.info('Options:')&lt;br /&gt;log.info(options.to_yaml)&lt;br /&gt;&lt;br /&gt;if options.infile&lt;br /&gt; input_stream = File.open(options.infile)&lt;br /&gt;else&lt;br /&gt; input_stream = $stdin&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;if options.outfile&lt;br /&gt; output_stream = File.new(options.outfile,'w')&lt;br /&gt;else&lt;br /&gt; output_stream = $stdout&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;### Actually do some stuff&lt;br /&gt;input_stream.each_line do |line|&lt;br /&gt; output_stream.puts line&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;### Wrap everything up&lt;br /&gt;output_stream.close&lt;br /&gt;input_stream.close&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;(If I wanted this a bit DRYer, I'd have a constant with the script name at the top so I wouldn't have to repeat that filename over and over. To be done...)&lt;br /&gt;&lt;br /&gt;If I wanted to over-organize, I'd create a little database with the descriptions of those scripts so I can search through them. Even though that would have it's use from time to time, that would be taking it too far. Having that documentation and a standard way of providing the command line arguments is enough for me for now.&lt;br /&gt;&lt;br /&gt;I must admit I let this slip in the last couple of months, which doesn't mean it didn't work the months before that. That's just what happens when you go on holiday and completely forgot the habit of doing this afterwards.&lt;br /&gt;&lt;br /&gt;UPDATE: InfiniteRed blogged about a &lt;a href="http://www.infinitered.com/blog/?p=21"&gt;similar approach&lt;/a&gt; later.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-7176484152588783389?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/7176484152588783389/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/07/documenting-one-off-parsers.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/7176484152588783389'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/7176484152588783389'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/07/documenting-one-off-parsers.html' title='Documenting one-off parsers'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-22712799284333136</id><published>2007-07-11T15:26:00.000+02:00</published><updated>2007-07-11T15:35:43.881+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='database'/><title type='text'>A choice of databases, or PostGres vs SQLite</title><content type='html'>When &lt;span style='font-weight: bold;'&gt;managing data&lt;/span&gt; in bioinformatics, one of the main tools you use is &lt;b&gt;databases&lt;/b&gt; to store the stuff. In many cases, flatfiles are sufficient, but sometimes you need the flexibility and reliance of a nice relational database. (Note: Excel is &lt;i&gt;not&lt;/i&gt; a database). Actually, in quite a few cases it &lt;i&gt;does &lt;/i&gt;make sense to use flat text files instead of databases, but that's the subject of another post.&lt;br /&gt;&lt;br /&gt;As there is a plethora of RDBMS (relational database management systems) available, the first thing you have to ask yourself is what system to use. There's &lt;a href='http://www.mysql.org'&gt;mysql&lt;/a&gt;, &lt;a href='http://www.oracle.com'&gt;oracle&lt;/a&gt; and others. But for the purpose of this post, I'll focus on &lt;span style='font-weight: bold;'&gt;PostGreSQL &lt;/span&gt;(&lt;a href='http://www.postgresql.org'&gt;http://www.postgresql.org&lt;/a&gt;) and &lt;span style='font-weight: bold;'&gt;SQLite3 &lt;/span&gt;(&lt;a href='http://www.sqlite.org'&gt;http://www.sqlite.org&lt;/a&gt;), because that's what I normally use.&lt;br /&gt;&lt;br /&gt;For the big projects with a lot of data, I normally use postgres, while smaller projects are typically served by sqlite databases. Both of these however have their good and bad things. Choice depends on a number of factors:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Installation&lt;/b&gt;. The main difference between postgres and sqlite3 is that one is based on running a server and the other is not. To be able to use postgres, you'll have to install the software, create a postgres user on your system and have that user start the postgres server. Next thing to do is create a user within the database system and grant it the right permissions (e.g. to create other users or databases). That's a bit more complex than installing sqlite3.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Database creation.&lt;/b&gt; To create a new database in PostGreSQL, you issue a &lt;font face='Courier'&gt;createdb MyDatabase&lt;/font&gt;. This will add the database to your RDBMS. You can get into the database with &lt;font face='Courier'&gt;psql MyDatabase&lt;/font&gt;. In contrast, sqlite3 databases are just text-files like any other. Create a new databasebase by calling it: &lt;font face='Courier'&gt;sqlite3 my_database.s3db&lt;/font&gt;. This will create a file in your present directory called &lt;font face='Courier'&gt;my_database.s3db&lt;/font&gt;. Note: you don't have to use the &lt;font face='Courier'&gt;.s3db&lt;/font&gt; extension, but that makes it recognizable for the SQLiteAdmin tool (see below).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Use.&lt;/b&gt; Once you've got your databases created, their use is really similar. On the command line, use the &lt;font face='Courier'&gt;psql&lt;/font&gt; or &lt;font face='Courier'&gt;sqlite3&lt;/font&gt; command. On Windows, you can e.g. &lt;a href='http://www.pgadmin.org/'&gt;pgAdmin&lt;/a&gt; and &lt;a href='http://sqliteadmin.orbmu2k.de/'&gt;SQLiteAdmin&lt;/a&gt;. &lt;br/&gt;&lt;br /&gt;I must say that &lt;font face='Courier'&gt;psql&lt;/font&gt; is a bit easier on day-to-day typing, because it's got tab-completion, which I haven't been able to activate yet in &lt;font face='Courier'&gt;sqlite3&lt;/font&gt;. I should spend some time one day to configure sqlite exactly like I want it to.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Locking.&lt;/b&gt; This is what bit me in the ankles with sqlite3... According to the website: "SQLite version 2.8 allowed multiple simultaneous readers or a single writer but not both. SQLite version 3.0 allows one process to begin writing the database while other processes continue to read. The writer must still obtain an exclusive lock on the database for a brief interval in order to commit its changes, but the exclusive lock is no longer required for the entire write operation." Indeed, I could load hundreds/thousands of records while simultaneously querying the database. But at what looked like a random moment, after hours of loading, it would choke, leaving me with a partially populated database. Any small child can tell you that this is &lt;span style="font-style:italic;"&gt;not&lt;/span&gt; what you want to happen. I've never encountered something like that in postgres, and probably never will.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Backup.&lt;/b&gt; Because sqlite if file-based, making a backup of a database is nothing more than taking a copy and moving that copy to another hard disk or burning it on a CD-ROM. To make backups of postgres databases, we typically use &lt;font face='Courier'&gt;pg_dump&lt;/font&gt;, which dumps the contents (and if you want the schema as well) of a database in a big-ass text file. So the backup is &lt;i&gt;outside&lt;/i&gt; of the RDBMS itself.&lt;br /&gt;&lt;br /&gt;So both RDBMS have their advantages and disadvantages. For huge projects, I tend to use PostGres, although I've started to use sqlite more and more lately. If only it wouldn't choke on the locking...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-22712799284333136?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/22712799284333136/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/07/choice-of-databases-or-postgres-vs.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/22712799284333136'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/22712799284333136'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/07/choice-of-databases-or-postgres-vs.html' title='A choice of databases, or PostGres vs SQLite'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-9116095459192816519</id><published>2007-07-05T22:11:00.001+02:00</published><updated>2007-07-06T10:40:32.297+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bioinformatics'/><category scheme='http://www.blogger.com/atom/ns#' term='organization'/><title type='text'>Bioinformatics and labbooks</title><content type='html'>What I've really struggled with in the first years after I ended up in bioinformatics (still in denial back then), was how to record what I'd done. Researchers in a wet-lab environment typically use a &lt;b&gt;labbook&lt;/b&gt; in which they write down what protocols they used for an experiment, together with the results and nice pictures of their PCR electrophoresis gels. This is a bit more difficult for bioinformaticians. Why? I think because the physical actions that are required to do the research are less pronounced. Filling out a PCR plate, putting it into a PCR machine and loading the result on a gel is a bit more effort than pressing &lt;i&gt;Enter&lt;/i&gt;. When we code, what we actually do, is create the PCR machine hardware. The programming language acts as the nuts and bolts used to create that machine. The effort for the bioinformatician is in building the hardware, not running the tests themselves (that's the sweat and tears of the servers, not ours).&lt;br /&gt;&lt;br /&gt;So I was pleasantly surprised when I arrived at my present job 2-and-a-bit years ago when they told me they even had an &lt;b&gt;SOP&lt;/b&gt; on how to perform and record bioinformatics work. Now SOPs can be restrictive and more about paperwork than actually helping you out, but coming from an environment where everything (or rather: nothing) goes regarding logging bioinformatics work, this piece of paper was a big relief.&lt;br /&gt;&lt;br /&gt;The main principle of a bioinformatics SOP is that you have to be able to record all steps you followed to &lt;b&gt;transform one piece of information into another&lt;/b&gt;. Same as what a lab-oriented SOP does: "how do you get from a BAC library and a pair of PCR primers to knowing which BAC is positive for that marker". But of course there are other things to record than in a lab. The main principle is that we have to be able to prove that the data we use are what we say they are. Same goes for the scripts.&lt;br /&gt;&lt;br /&gt;The &lt;b&gt;input&lt;/b&gt; data. The startpoint for most of my dry-lab experiments is data that I downloaded from the internet. So what do I (have to) record? The URL of the file and the &lt;b&gt;md5sum &lt;/b&gt;of the file once it's downloaded. That md5sum makes sure that, in case I have to reanalyze, I can check if that big sequencing centre has changed the contents of their files or not (of course without telling us).&lt;br /&gt;&lt;br /&gt;The &lt;b&gt;scripts&lt;/b&gt;. Same goes for the scripts. Apart from the fact that it's always a good thing to document your scripts, you also have to be able to prove afterwards that that actually was the script you used, and not just some other piece of code with the same name. Can be done with md5sums again.&lt;br /&gt;It does happen that I logged the md5sum and still change the script afterwards (e.g. setting constants to another value). As long as that is logged as well, we can reconstruct the original script. Of course it is better to circumvent that problem by creating &lt;b&gt;generic code&lt;/b&gt;, but sometimes the effort of doing just that outweighs the benefits gained. (Might be a subject of a later post.) Having the md5sum of the script acts as logging the manufacturer and type of PCR machine you used.&lt;br /&gt;&lt;br /&gt;The &lt;b&gt;output&lt;/b&gt;. The combination of identity-checked input and scripts should &lt;i&gt;always &lt;/i&gt;produce the same output. For completeness sake, the SOP tells us to also log the md5sum again...&lt;br /&gt;&lt;br /&gt;The system around it. So how do we actually do the &lt;b&gt;recording itself&lt;/b&gt;? Where I work, the end product has to be a filled-in labbook. What we do, is use a task tracker (&lt;a href="http://bestpractical.com/rt/"&gt;RT Request Tracker&lt;/a&gt;) to, well, track our tasks (duh). The moment we start a new project (let's define that here the GTD-way as anything that consists of more than one physical action), we create a new ticket (e.g. "Identify syntenic region between species A and B") and log everything in there: the project directory, background information, md5sums, workflows, interpretation of the results. When all is finished, we make a hard copy (well: print it out) and glue it into our labbooks.&lt;br /&gt;&lt;br /&gt;In some settings it might be more sensible to log things in a wiki, as explained by Mike at bioinformaticszen (&lt;a href="http://www.bioinformaticszen.com/2007/04/use-a-hyperlinked-document-as-a-bioinformatics-lab-book/"&gt;here&lt;/a&gt;), where he talks about using a &lt;b&gt;hyperlinked document&lt;/b&gt; or a &lt;b&gt;wiki&lt;/b&gt; to track what you've done. Of course it can make sense in many environments (i.e. if you don't have to care about audits and are tracking stuff merely for yourself), but the moment you have to be able to present stuff for audits or patent applications, it's critical that the documentation you generated is immutable. Don't get me wrong: I'm a big fan of the wiki-way (and use one at work as well), but not as a labbook. What do I find a wiki useful for, is &lt;b&gt;operating guidelines&lt;/b&gt;, which are allowed to change over time.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-9116095459192816519?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/9116095459192816519/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/07/bioinformatics-and-labbooks.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/9116095459192816519'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/9116095459192816519'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/07/bioinformatics-and-labbooks.html' title='Bioinformatics and labbooks'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-8746127397352645110</id><published>2007-07-03T14:43:00.000+02:00</published><updated>2008-12-09T19:02:29.094+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='productivity'/><category scheme='http://www.blogger.com/atom/ns#' term='GTD'/><title type='text'>Six months of Getting Things Done</title><content type='html'>It's the start of July and about 6 months after I started implementing the Getting Things Done meme. And, man, did it change things... In this post, I'll share some of my experiences and the obstacles I encountered.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;What is that Getting Things Done already?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I'm not going to explain the whole thing here. There are better resources for that (for a list of URLs, see the bottom of this post). Most importantly: get the book "&lt;span style="font-weight: bold;"&gt;Getting Things Done: The Art of Stress-Free Productivity&lt;/span&gt;" by David Allen. Basically, GTD is a way of turning the overwhelming amount of 'stuff' we have to remember to do into a manageable system and out of your head.&lt;br /&gt;&lt;br /&gt;Implementing the GTD system has helped me a lot in the last half year. For starters, I got things done (more than I would normally have, I think) while at the same time it allowed me to have less on my mind. I no longer have that dreaded nagging feeling of "I should remember to do this" while knowing that I would forget something else by remembering it. The time that I couldn't get to sleep because I had to remember this and this and that and the other is now a distant memory. One of the effects that it also had was that it was easier to quickly switch between different jobs I had to do. I'm not yet at the stage of what the book calls a "mind like water" (i.e. neither over- or underreacting to anything), but it's started to get a bit fluid. Stuff like responding inappropriately to email, projects, thoughts about what I need to do (the over- or underreacting) leads to less effective results.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;* What works / my system&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;What parts of the GTD system work for me? Up till now, it looks like I mainly focussed on the day-to-day tasks rather than the long-term goals and someday/maybe.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;font-size:100%;" &gt;The hardware:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A good &lt;span style="font-weight: bold;"&gt;pen&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;A &lt;span style="font-weight: bold;"&gt;filofax&lt;/span&gt;. After trying different low and high-tech ways to implement GTD, I finally ended up using a filofax. It's a black pocket-size (8x12cm) one called &lt;a href="http://www.filofax.co.uk/store/SEURLF/ASP/SFS/DISPLAY./SIZEID.2/RANGEID.18/DSIZEID.2/SFE/organiser.htm"&gt;Identity&lt;/a&gt;. I renamed the different tabs in it to 'Next Actions', 'Projects', 'Waiting For', 'Someday', 'Calendar' and 'Lists'.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Project folders&lt;/span&gt;. I started creating project folders both at work and at home. I used to take the same notebook to all my meetings to take my minutes. That's now shifted to taking a single piece of paper to them, taking my notes and then archiving that paper in the project folder. It's always good to just be able to take a single project folder to a meeting containing all minutes and other supporting information.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;font-size:100%;" &gt;How I use that hardware:&lt;/span&gt;&lt;span style="font-size:85%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Next Actions and Projects&lt;/span&gt;. Always having the filofax with me, I instantly write down anything that comes into my head that I have to remember to do. At that moment, I don't make the decision if I have to put it in next actions or projects or someday/maybe. I just jot it down on my next actions list. The moment that I come to actually doing that thing, it might be sensible to convert it into a project or a someday/maybe item. At any one moment, I have 40 or so items on my next actions list.&lt;br /&gt;David Allen talks about having different next action lists for different &lt;span style="font-weight: bold;"&gt;contexts&lt;/span&gt;: a list for stuff to do at home (@home), a list for stuff to do at the office (@office), while at a computer (@pc), while with a phone (@phone) or when going shopping (@shop). That definitely didn't work for me. I ended up having just &lt;span style="font-style: italic;"&gt;one &lt;/span&gt;next actions list, using the type of bullet to distinguish between the different contexts. The different bullet point types I use are: a triangle (i.e. a stylized tear-drop for the tears of sweat I shed at work), a square with its bottom line missing (a stylized version of a house), a little phone and a dollar-symbol.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_t6Ob1J7aZ0A/RopKgbMW1EI/AAAAAAAAACo/DuBLEPjgidU/s1600-h/next_actions.png"&gt;&lt;img style="cursor: pointer;" src="http://4.bp.blogspot.com/_t6Ob1J7aZ0A/RopKgbMW1EI/AAAAAAAAACo/DuBLEPjgidU/s200/next_actions.png" alt="" id="BLOGGER_PHOTO_ID_5082957050356880450" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I noticed that it's really important to make a real distinction in &lt;span style="font-weight: bold;"&gt;how you phrase&lt;/span&gt; the next actions and projects. Projects are basically things you have to do that will require more than 1 step, while next actions are the indivisible counterpart. It really helps if you phrase the next actions as verbs ("write unit test"), while the projects are phrased as end-points rather than verbs ("API published" instead of "create API").&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Calendar&lt;/span&gt;. I've been relatively successful in putting all my meetings in the filofax calendar. To remember to do stuff with a given deadline, I write a note in my calendar for example a forthnight before that says "add to next actions: write poster abstract". I do something similar for recurring events: I'll fill the next ten or so, and add a note with the last occurrence to fill in the next ten.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Templates&lt;/span&gt;. The filofax came with a bunch of empty templates to use. As it didn't take too long to get through those, I created a new template in OpenOffice with a grid and showing where to punch the holes. Just printing it out and making double-sided copies gives me all the empty sheets I want.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;* What kinda works&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-weight: bold;"&gt;weekly review&lt;/span&gt;. This is one of the cornerstones of the system, where you take a little time once a week to go over your next actions, projects and other lists. Until recently, I've neglected this quite a few times. However, it's easy to pick up again and that's what I did. What I do, is purge the next actions list (taking the time to rewrite the actions on virgin sheets), check that I have at least one next action related to each project, and ask myself if I have to chase people on my waiting_for list.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Empty inbox&lt;/span&gt;. One of the things you _can_ do, is keep an empty inbox. I got to the stage of the empty inbox during a few weeks, but let it slip again. At the moment there are 1060 mails in it, but I know it will be relatively straightforward to do the big reorganize/purge exercise again.&lt;br /&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;* What doesn't work (yet)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Tickler file&lt;/span&gt;. The tickler file is a set of folders to organize paperwork that has to be done by a certain date (see the Wikipedia page for GTD). Although I won't implement this at work, we haven't really decided yet if this is really useful or not at home.&lt;br /&gt;&lt;br /&gt;GTD &lt;span style="font-weight: bold;"&gt;software&lt;/span&gt;. Even though I'm a bioinformatician, I noticed that using software to keep track of my lists doesn't do it for me. I've tried ThinkingRock, BackPack (from 37signals) as well as a host of other little applications. In the end, not being able to use that system while I'm on the bus, in a shop or just sitting in the living room made it clear that I had to go for the analog version.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Literature&lt;/span&gt;. This is what I'm annoyed about the most at the moment. Even though it's not really part of GTD, reading literature should fit into the bigger system in some way or another. I haven't found out how yet. I like to print out the papers and make notes directly on them, but that's no way of organizing the information contained in them. I'm thinking about using CiteULike as a reference manager (don't like Reference Manager or Endnote) and making notes using Adobe Acrobat on the PDF. To be continued...&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;* The future&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I noticed that the elements of GTD that I'm using are focused on day-to-day work. I'm not really happy with how I use the someday/maybe list in my filofax, do not have a list of big goals, ... So while I will continue to use this system in the next couple of months, I'll probably try working on the big picture as well. I might keep a couple of lists at home to do that. It's not necessary to carry the list with things you want to do one day or your major goals with you all the time.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:100%;" &gt;Conclusion&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Based on what I've experienced, it is really worthwhile to implement the Getting Things Done strategy to get to stress-free productivity. Start with reading the book. It's important to find your own implementation and decide what things you want to implement or not: next_action lists, empty inbox. You don't have to bring every suggestion into practice. There are many things in the book that I decided I wouldn't need.&lt;br /&gt;&lt;br /&gt;It's really easy to start following the GTD philosophy (although becoming a black-belt requires a lot of work), and it's no problem if you fall off the wagon for some reason after a while without wanting to: it's straigthforward to jump on it again. Just start using those lists again.&lt;br /&gt;&lt;br /&gt;Some really good tips on starting with GTD can be found on &lt;a href="http://gtd.marvelz.com/blog/2007/02/26/10-simple-tips-to-start-getting-things-done/"&gt;this&lt;/a&gt; blog entry.&lt;br /&gt;&lt;br /&gt;And now back to some real work...&lt;br /&gt;&lt;span style="font-size:100%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;&lt;span style="font-weight: bold;"&gt;Links&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;* The book: see Amazon "Getting Things Done: The Art of Stress-Free Productivity"&lt;br /&gt;* &lt;a href="http://members.optusnet.com.au/%7Echarles57/GTD/orgmode.html#sec-1"&gt;Introduction of concepts&lt;/a&gt;&lt;br /&gt;* &lt;a href="http://gtd.marvelz.com/blog/2007/02/26/10-simple-tips-to-start-getting-things-done/"&gt;How to start&lt;/a&gt;&lt;br /&gt;* Big list of GTD &lt;a href="http://www.atpm.com/13.02/next-actions.shtml"&gt;software&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-8746127397352645110?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/8746127397352645110/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/07/six-months-of-getting-things-done.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/8746127397352645110'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/8746127397352645110'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/07/six-months-of-getting-things-done.html' title='Six months of Getting Things Done'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_t6Ob1J7aZ0A/RopKgbMW1EI/AAAAAAAAACo/DuBLEPjgidU/s72-c/next_actions.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-4592546315351728412</id><published>2007-06-29T16:41:00.001+02:00</published><updated>2007-06-29T17:31:07.769+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='organization'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><title type='text'>Naming conventions</title><content type='html'>Naming conventions. You bump into them every single minute of the day. Naming new directories in your project folder, naming new tables in your database, ... Recently, the issue of naming convention came more to the foreground for me as I'm trying to write a ruby API to one of our databases (see later).&lt;br /&gt;&lt;br /&gt;Two of the most-often-encountered naming schemes are &lt;span style="font-weight: bold;"&gt;CamelCase &lt;/span&gt;(ThisIsACamelCaseString) and&lt;span style="font-weight: bold;"&gt; snake_case&lt;/span&gt; (this_is_a_snake_case_string). And in the case of CamelCase: do you make the very first letter a capital or not? If I'm not mistaken, variables in java are often CamelCase, except the first letter (thisCouldBeAJavaVariable).&lt;br /&gt;&lt;br /&gt;When thinking of names for directories and files (read also "&lt;a href="http://www.bioinformaticszen.com/2007/02/organising-yourself-as-a-dry-lab-scientist/"&gt;Organizing yourself as a dry-lab scientist&lt;/a&gt;" on BioinformaticsZen), i.e. when there's no set naming convention that you &lt;span style="font-style: italic;"&gt;have &lt;/span&gt;to follow (e.g. variable naming conventions), I tend to use different schemes for directories versus files for some reason. For naming &lt;span style="font-weight: bold;"&gt;directories&lt;/span&gt;, I use an underscore to separate different things in the same name. For example, I name my folders by concatenating the date they were created with the RT Task Tracker ticket and a description (the latter being in camelcase). For example &lt;span style="font-style: italic;"&gt;~/20070629_RT12345_ThisIsADirectory&lt;/span&gt;. For &lt;span style="font-weight: bold;"&gt;files&lt;/span&gt;, I normally use snake_case, except for scripts... Why? Maybe to distinguish those scripts from the data files. For example:&lt;pre&gt;&lt;br /&gt;+- Documents&lt;br /&gt;+- Projects&lt;br /&gt;     +- 20070629_RT12345_ThisIsASampleProject&lt;br /&gt;     |       +- input_file.txt&lt;br /&gt;     |       +- output_file.txt&lt;br /&gt;     |       +- log_file.txt&lt;br /&gt;     |       +- ParseInput.rb&lt;br /&gt;     +- 20070629_RT23446_AnotherProject&lt;br /&gt;            +- input_file.txt&lt;br /&gt;            +- output_file.txt&lt;br /&gt;            +- log_file.txt&lt;br /&gt;            +- ParseInput.rb&lt;/pre&gt;&lt;br /&gt;It would probably not be a bad idea to start to use a default &lt;span style="font-style: italic;"&gt;data &lt;/span&gt;folder or something to keep all the input, output and other files, and a &lt;span style="font-style: italic;"&gt;script &lt;/span&gt;folder for the scripts... I should take a look again at the BioinformaticsZen blog.&lt;br /&gt;&lt;br /&gt;Of course things are completely different when you're &lt;span style="font-weight: bold;"&gt;coding &lt;/span&gt;or setting up &lt;span style="font-weight: bold;"&gt;databases&lt;/span&gt;. In these cases, your preferred programming language will have it's own conventions. In ruby, for example, classes are in CamelCase, while variables should be snake_case. All good and well, until they start to bite you in the you-know. I'm trying to create a ruby API to an existing database that would require &lt;a href="http://wiki.rubyonrails.org/rails/pages/UnderstandingPolymorphicAssociations"&gt;polymorphic associations&lt;/a&gt;. This requires that the values in the something_something_type column should be class names, which are CamelCase and singular. But of course, the database has everything in snake_case. I found a workaround to get the thing up and running with snake_case, except that it requires the value to be plural, which it of course isn't. End result: I'll have to get the actual data values in the database changed to get this thing working.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-4592546315351728412?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/4592546315351728412/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/06/naming-conventions_29.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4592546315351728412'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4592546315351728412'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/06/naming-conventions_29.html' title='Naming conventions'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-624292394334902776</id><published>2007-06-26T12:04:00.001+02:00</published><updated>2008-12-09T19:02:29.327+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='annotation'/><title type='text'>Manual genome annotation tools</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_t6Ob1J7aZ0A/RoD_dh9tgLI/AAAAAAAAACg/KUGFQkaPWSs/s1600-h/argo.jpg"&gt;&lt;img style="cursor: pointer;" src="http://2.bp.blogspot.com/_t6Ob1J7aZ0A/RoD_dh9tgLI/AAAAAAAAACg/KUGFQkaPWSs/s320/argo.jpg" alt="" id="BLOGGER_PHOTO_ID_5080341262472413362" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;An important part of genomics and genetics research is to know where your genes of interest lie on the genome and what the gene model looks like. In other words: to know where do the exons start and stop, what the UTR boundaries are, and where there are any polymorphisms in those genes. That's called &lt;span style="font-weight: bold;"&gt;genome annotation&lt;/span&gt;, that is.&lt;br /&gt;&lt;br /&gt;With the sequencing of any new genome, the annotation of its genes is the logical next step. This often consists of 2 main phases: there's the &lt;span style="font-weight: bold;"&gt;automated annotation&lt;/span&gt; by BLASTing against known sequences or even &lt;span style="font-style: italic;"&gt;de novo&lt;/span&gt; gene annotation. The second phase is the &lt;span style="font-weight: bold;"&gt;manual curation&lt;/span&gt; of those automated annotations: biologists or bioinformaticians with a biology background looking at those gene models and correcting things like "there should be an additional exon here", "this exon-intron boundary is 2 bp off" or "this actually is an alternative transcript".&lt;br /&gt;&lt;br /&gt;Over the last two years, I've tried out several software packages to deal with the manual curation or annotation of sequences. And I must say: it hasn't been great. Let's walk through them:&lt;br /&gt;&lt;br /&gt;&lt;a style="font-weight: bold;" href="http://www.invitrogen.com/content.cfm?pageid=10373"&gt;VectorNTI&lt;/a&gt;&lt;span style="font-weight: bold;"&gt;:&lt;/span&gt;&lt;br /&gt;Is a commercial package from Invitrogen. I didn't try this tool recently, partly because my experiences with it on my last job was less then impressive. From what I remember, it becomes unusable when you have to handle larger sequences or when you've got a high density of features. Things might have changed in the mean time, but remarks at the water cooler from colleagues do not support that hope.&lt;br /&gt;&lt;br /&gt;&lt;a style="font-weight: bold;" href="http://www.sanger.ac.uk/Software/Artemis/"&gt;Artemis&lt;/a&gt;&lt;span style="font-weight: bold;"&gt;:&lt;/span&gt;&lt;br /&gt;The Artemis tool from the Sanger Institute in itself is a great tool with a lot of functionality. You can launch the thing using Java Webstart without having to install it on your own computer. However, I found it to be lacking in user intuitiveness and usability. In addition, it looked like it used non-standard ways to store my annotations. I found that any annotations that I made were stored in the original FASTA-file that I loaded as GenBank annotations at the top. Result: the FASTA-file itself became invalid, and it wasn't a GenBank file either. Still, this tool can be very useful for very small projects.&lt;br /&gt;&lt;br /&gt;&lt;a style="font-weight: bold;" href="http://www.fruitfly.org/annot/apollo/"&gt;Apollo&lt;/a&gt;&lt;span style="font-weight: bold;"&gt;:&lt;/span&gt;&lt;br /&gt;This is the biggy, developed by the FlyBase people in collaboration with the Sanger Institute in the UK. It's the tool that is referred to the most in the community. But not the tool that I'll use anymore. First of all, it is recommended that you have at least 2 Gigs of RAM when you try to run it. That's right: 2,000 Mb of the stuff. Not your average desktop PC, then... In addition, many people have reported that it crashed on them at random leaving their unsaved work, well, unsaved. The tool also has a lot of features. Too many, actually. Finding out how you can do something can take really long because you've got a haystack of things to go through.&lt;br /&gt;&lt;br /&gt;What was the biggest drawback for me, was that I was not able to import my own externally generated results into the tool. You can only do that by creating a GameXML file, which is &lt;span style="font-style: italic;"&gt;way &lt;/span&gt;to cumbersome to do.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Otterlace:&lt;/span&gt;&lt;br /&gt;The &lt;a href="http://www.sanger.ac.uk/HGP/havana/"&gt;HAVANA annotation group&lt;/a&gt; uses a tool based on AceDB for annotation. I can just be brief about this: I think it's a really good tool, but as it's not available to annotators outside the HAVANA group, has to be discarded as an option.&lt;br /&gt;&lt;br /&gt;&lt;a style="font-weight: bold;" href="http://www.broad.mit.edu/annotation/argo/"&gt;Argo&lt;/a&gt;&lt;span style="font-weight: bold;"&gt;:&lt;/span&gt;&lt;br /&gt;And then I found Argo (from the Broad Institute). This finally looks like a tool that does what it should do: it has an intuitive interface and overview of your genomic region, it has import and export filters to GFF 1, 2 and 3 as well as GTF. It also allows you to quickly check for non-standard intron-exon boundaries and translations of your gene models and a whole lot of other features. Same as Apollo does, but without making it confusing and complicated. I would suggest this tool in combination with &lt;a href="http://www.cgb.ki.se/cgb/groups/sonnhammer/Blixem.html"&gt;blixem&lt;/a&gt; (to check BLAST results in detail).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-624292394334902776?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/624292394334902776/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/06/manual-genome-annotation-tools.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/624292394334902776'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/624292394334902776'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/06/manual-genome-annotation-tools.html' title='Manual genome annotation tools'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_t6Ob1J7aZ0A/RoD_dh9tgLI/AAAAAAAAACg/KUGFQkaPWSs/s72-c/argo.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-6845799994295338808</id><published>2007-06-21T11:23:00.000+02:00</published><updated>2007-11-06T14:40:53.040+01:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='ActiveRecord'/><category scheme='http://www.blogger.com/atom/ns#' term='GTD'/><title type='text'>Databases and ruby (without rails)</title><content type='html'>Just bumped into a really nice O'Reilly blog &lt;a href="http://www.oreillynet.com/pub/a/ruby/2007/06/21/how-to-build-simple-console-apps-with-ruby-and-activerecord.html"&gt;article&lt;/a&gt; that combines two of the things I like to work with: &lt;a href="http://en.wikipedia.org/wiki/Getting_Things_Done"&gt;GTD&lt;/a&gt; and &lt;a href="http://www.ruby-lang.org/"&gt;ruby&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;As a bioinformatician, I often have to handle quite a lot of data, which I tend to put into databases. Ruby has a fabulous framework in Rails to access and manipulate data, but after having created rails applications for most of those, it became more and more clear that it would be preferable to just have the power of &lt;span style="font-weight: bold;"&gt;ActiveRecord &lt;/span&gt;without having to deploy the whole rails-thing.&lt;br /&gt;&lt;br /&gt;Similar to what is discussed in the blog article by Gregory Brown, I created a &lt;span style="font-weight: bold;"&gt;directory template&lt;/span&gt; including the migration code, a connection configuration and a Rakefile. The directory structure looks like this:&lt;br /&gt;&lt;pre&gt;+- config&lt;br /&gt;|   +- project_config.yaml&lt;br /&gt;|   +- load_config.yaml&lt;br /&gt;+- db&lt;br /&gt;|   +- migrate&lt;br /&gt;|   |    +- 001_initial_schema.rb&lt;br /&gt;|   +- import&lt;br /&gt;+- lib&lt;br /&gt;|   +- models.rb&lt;br /&gt;+- Rakefile&lt;br /&gt;+- README&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Let's walk through this:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;The &lt;span style="font-style: italic;"&gt;project_config.yaml&lt;/span&gt; file contains the project name and connection settings to get to the database. For example&lt;/li&gt;&lt;br /&gt;&lt;pre&gt;project:&lt;br /&gt;name: MyFunkyProject&lt;br /&gt;database:&lt;br /&gt;adapter: sqlite3&lt;br /&gt;name: db/my_funky_project.s3db&lt;/pre&gt;&lt;br /&gt;&lt;li&gt;The &lt;span style="font-style: italic;"&gt;load_config.rb&lt;/span&gt; file uses that information to connect to the database.&lt;/li&gt;&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;require 'rubygems'&lt;br /&gt;require_gem 'activerecord'&lt;br /&gt;&lt;br /&gt;class ProjectConfig&lt;br /&gt;  attr_accessor :project_root&lt;br /&gt;  attr_accessor :project_name&lt;br /&gt;  attr_accessor :db_adapter&lt;br /&gt;  attr_accessor :db_name&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;$config = ProjectConfig.new&lt;br /&gt;$config.project_root = File.dirname(__FILE__) + '/..'&lt;br /&gt;&lt;br /&gt;YAML.load_documents(File.open($config.project_root + '/config/project_config.yaml')) do |p|&lt;br /&gt;  $config.project_name = p['project']['name']&lt;br /&gt;  $config.db_adapter = p['database']['adapter']&lt;br /&gt;  $config.db_name = p['database']['name']&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;$connection_settings = Hash.new&lt;br /&gt;$connection_settings[:adapter] = $config.db_adapter&lt;br /&gt;if $config.db_adapter == 'sqlite3'&lt;br /&gt;  $connection_settings[:dbfile] = $config.project_root + '/' + $config.db_name&lt;br /&gt;else&lt;br /&gt;  $connection_settings[:database] = $config.db_name&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;ActiveRecord::Base.establish_connection($connection_settings)&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;The &lt;span style="font-style: italic;"&gt;lib/models.rb&lt;/span&gt; file contains the... models, obviously. It should 'require' the load_config.rb file to get the connection.&lt;/li&gt;&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;require File.dirname(__FILE__) + '/../config/load_config.rb'&lt;br /&gt;&lt;br /&gt;class Task &lt; ActiveRecord::Base&lt;br /&gt;  belongs_to :project&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;class Project &lt; ActiveRecord::Base&lt;br /&gt;  has_many :tasks&lt;br /&gt;end&lt;/pre&gt;&lt;br /&gt;&lt;li&gt;The &lt;span style="font-style: italic;"&gt;db/migrate/001_initial_schema.rb&lt;/span&gt; is used to create the database.&lt;/li&gt;&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;require File.dirname(__FILE__) + '/../../config/load_config.rb'&lt;br /&gt;&lt;br /&gt;class InitialSchema &lt; ActiveRecord::Migration&lt;br /&gt;  def self.up&lt;br /&gt;    create_table :tasks do |t|&lt;br /&gt;      t.column :description, :string&lt;br /&gt;      t.column :project_id, :integer&lt;br /&gt;    end&lt;br /&gt;    create_table :projects do |t|&lt;br /&gt;      t.column :description, :string&lt;br /&gt;    end&lt;br /&gt;  def self.down&lt;br /&gt;    drop_table :tasks&lt;br /&gt;    drop_table :projects&lt;br /&gt;  end&lt;br /&gt;end&lt;/pre&gt;&lt;br /&gt;&lt;li&gt;The &lt;span style="font-style: italic;"&gt;import&lt;/span&gt; directory will then hold a group of loading scripts, that 'require' the &lt;span style="font-style: italic;"&gt;lib/models.rb&lt;/span&gt; file. They look something like this:&lt;/li&gt;&lt;br /&gt;&lt;pre name="code" class="ruby"&gt;&lt;br /&gt;require File.dirname(__FILE__) + '/../../lib/models.rb'&lt;br /&gt;&lt;br /&gt;File.open('my_file.txt').each do |line|&lt;br /&gt;  line.chomp!&lt;br /&gt;  # do_something useful&lt;br /&gt;end&lt;/pre&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;As you can see, this directory template - useful as it is for me at the moment - can be rationalized a bit, and I should add tests and stuff. Gregory Brown's article might just give me the right ideas to do that.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-6845799994295338808?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/6845799994295338808/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/06/databases-and-ruby-without-rails.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/6845799994295338808'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/6845799994295338808'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/06/databases-and-ruby-without-rails.html' title='Databases and ruby (without rails)'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-4867372421772813569.post-4959894903340763343</id><published>2007-06-19T17:38:00.001+02:00</published><updated>2007-06-21T11:00:46.377+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='literature'/><category scheme='http://www.blogger.com/atom/ns#' term='data integration'/><title type='text'>Anatomy of data integration</title><content type='html'>The paper "Anatomy of data integration" by Brazhnik &amp; Jones (J Biomed Inf 40:252-269 (2007)) gives a clear high-level overview of what is involved in the process of acquiring data from different sources and how to integrate them. Apart from talking about information pipelines and conceptual data models, it delves deeper into the concept of types of &lt;span style="font-weight: bold;"&gt;data elements&lt;/span&gt; (DEs). It really all speaks for itself, but it's nice to be able to name things in a meaningful way. Some good things that you already know, but are still helpful to write down.&lt;br /&gt;&lt;br /&gt;Basically, in the context of a data source, you can distinguish between 2 types of DEs: ''focal'' and ''peripheral''. The &lt;span style="font-weight: bold;"&gt;focal DEs&lt;/span&gt; are usually mandatory and have a high quality, often because they are the primary reason for building the database in the first place (e.g. the assay sequence in a dbSNP record). In contrast, the &lt;span style="font-weight: bold;"&gt;peripheral DEs&lt;/span&gt; are often optional and are much more prone to error (e.g. the number of chromosomes sampled in a dbSNP record). As a remark: making peripheral DEs mandatory is asking for trouble. If I don't know the number of chromosomes sampled for a SNP but I'm forced to fill in some number, than that number will insert wrong data which is infinitely worse than having a nil value.&lt;br /&gt;&lt;br /&gt;Meaningful integration can only occur between data sources that have a shared pool of focal DEs. You obviously don't try to integrate two different datasets based on the number of chromosomes sampled... This also means that in order to study correlations between two distant domains, you'd need to build what they call a multi-step integration staircase.&lt;br /&gt;&lt;br /&gt;From the viewpoint of the &lt;span style="font-weight: bold;"&gt;integration itself&lt;/span&gt;, the focal DEs can be subdivided into two distinct groups: ''integration keys'' and ''informative elements''.&lt;br /&gt;Integration keys are the backbone of the integration and a combination of DEs that identify exactly the same entity in two sources. They are chosen from the overlapping focal DEs. The informative elements represent to goal of the integration and contain the information that we actually want.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;What if there are different data sources for the same data set? To make things worse, it might even be that particular data elements are similar but not the same. For example: you can get SNP data from NCBI and from Ensembl, but people have noticed recently that the same SNP can be annotated on a different strand depending on what database you look at. In this case, a practice of keeping all redundant data along with the information about the source becomes important. The quality of the data might vary between the different sources, and having all information available makes it possible for the researcher to make decisions based on all available evidence. "My SNP is on the forward strand according to NCBI, but on the reverse strand according to Ensembl. I trust Ensembl more, so..."&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/4867372421772813569-4959894903340763343?l=saaientist.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://saaientist.blogspot.com/feeds/4959894903340763343/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://saaientist.blogspot.com/2007/06/anatomy-of-data-integration.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4959894903340763343'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/4867372421772813569/posts/default/4959894903340763343'/><link rel='alternate' type='text/html' href='http://saaientist.blogspot.com/2007/06/anatomy-of-data-integration.html' title='Anatomy of data integration'/><author><name>Jan Aerts</name><uri>http://www.blogger.com/profile/06333918504426826153</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/-B3VSi_YUtlc/TgCih9bu4YI/AAAAAAAADXw/OrCRjCBhsXo/s220/1300111711084.jpg'/></author><thr:total>0</thr:total></entry></feed>
