1. Ryo Sakai reminded me a couple of weeks ago about Simon Sinek's excellent TED talk "Start With Why - How Great Leaders Inspire Action"; which inspired this post... Why do I do what I do?

    The way data can be analysed has been automated more and more in the last few decades. Advances in machine learning and statistics make it possible to gain a lot of information from large datasets. But are we starting to rely to much on those algorithms? Different issues seem to pop up more and more. For one thing, research in algorithm design has enabled many more applications, but at the same time makes these so complex that they start to operate as black boxes. Not only to the end-user who provides the data, but even for the algorithm developer. Another issue with pre-defined algorithms is that having these around precludes us to identifying unexpected patterns. If the algorithm or statistical test is not specifically written to find a certain type of pattern, it will not find it. Third issue: (arbitrary) cutoffs. Many algorithms rely heavily on the user (or even worse: the developer) defining a set of cutoff values. This is true in machine learning as well as statistics. A statistical test returning a p-value of 4.99% is considered "statistically significant", but you'd throw away your data if that p-value were 5.01%. What's the intrinsic thing at 5% that makes you have to choose between "yes, this is good" and "let's throw our hypothesis out the window"? All in all, much of this comes back to the fragility of using computers (hat tip to Toni for the book by Nassim Taleb): you have to tell them what to do and what to expect. They're not resilient to changes in setting, data, prior knowledge, etc; at least not as much as we are.

    So where does this bring us? It's my firm belief that we need to put the human back in the loop of data analysis. Yes, we need statistics. Yes, we need machine learning. But also: yes, we need a human individual to actually make sense of the data and drive the analysis. To make this possible, I focus on visual design, interaction design, and scalability. Visual design because the representation of data in many cases needs improvement to be able to cope with high-dimensional data; interaction design because it's often by "playing" with the data that the user can gain insights; and scalability because it's not trivial to process big data fast enough that we can get interactivity.
    2

    View comments

  2. From: http://cartoonfestival.knokke-heist.be/pagina/iedereen-geniaal

    "I'll do Angelina Jolie". Never thought I'd say that phrase while talking to well-known Belgian cartoonists, and actually be taken serious.

    Backtrack about one year. We're at the table with the crème-de-la-crème of Belgium's cartoon world (Zaza, Erwin Vanmol, LECTRR, Eva Mouton, ...), in a hotel in Knokke near the coast.  "We" is a gathering of researchers covering genetics, bioinformatics, ethics, and law. The setup: the Knokke-Heist International Cartoon Festival. This very successful festival centers each year around a particular topic. 2013 was "Love is..."; 2014 is about genetics. Hence our presence at the site. On the program for day 1: explaining genetics and everything that gets dragged into it (privacy, etc) to the cartoonists. Day 2: discussion on which messages we should bring at the festival, and a quick pictionary to check if we actually explained the concepts well. (As I was doodling myself at the moment, I briefly got to be a "cartoonist" as well and actually draw one of those :-)
    "So what's the thing with Angelina Jolie?", you ask? We figured that she be the topic of part of the cartoonfestival installation (talking about breast cancer, obviously), and I volunteered to help out setting up that section...



    Fast forward to this late summer. The cartoonfestival is in full swing, and I'm trying to explain the genetic dogma and codon table to a bunch of 8-13 year-olds in the Children's University. I thought it'd be nice to let them muck about with strawberries to get the DNA out, and write their names in secret code (well: just considering each letter to be an amino acid...). I was really nervous in the days/weeks before the actual event; kids can be a much harsher public than university students. Or so I thought; it was quite the opposite: feedback from the children was marvellous and I really enjoyed their enthusiasm. To be repeated... :)

    I know this post is way overdue (especially given the fact that the cartoonfestival actually closed last weekend). But with this I hope to resurrect this blog from its comatose state since I started my current position 4 years ago...




    0

    Add a comment


  3. We could still use more applicants for this position, so bumping the open position...

    Available: Research position Biological Data Visualization and Visual Analytics


    Keywords: biological data visualization; visual analytics; data integration; genomics; postdoc

    Are you well-versed in the language of Tufte? Do you believe that visualization plays a key role in understanding data? Do you like to work in close collaboration with domain experts using short iterations? And do you want to use your visualization skills to help us understand what makes a cancer a cancer, and what distinguishes a healthy embryo from one that is not?

    We're looking for a motivated data visualization specialist to help biological researchers understand variation within the human genome. Methodologies exist for analyzing this type of data, but are still immature and return very different results depending on what assumptions are made. The type of data can also be used for a huge amount of different research questions, which necessitates developing very exploratory tools to support hypothesis generation.

    Profile
    The ideal candidate is well-motivated, holds a PhD (or at least MSc) degree in computer science or bioinformatics, and has experience in data visualization (e.g. using tools like D3 [http://d3js.org] or Processing [http://processing.org]). Prior experience working with DNA sequencing data and genome-wide detection of genetic variation would be an advantage but is not crucial. Good communication skills are important for this role.

    You will collaborate closely with biologists and contribute to the reporting of the project. You will be able to work semi-independently under the supervision of a senior investigator, mentor PhD students, and contribute to the acquisition of new funding. A three-year commitment is expected. Start date is as soon as possible.

    Relevant publications

    • Medvedev P, Stanciu M & Brudno M. Computational methods for discovering structural variation with next-generation sequencing. Nat Methods 6(11):S13-S20 (2009)
    • Nielsen CB, Cantor M, Dubchak I, Gordon D & Ting W. Visualizing genomes: techniques and challenges. Nat Methods 7:S5-S15 (2010)
    • Bartlett C, Cheong S, Hou L, Paquette J, Lum P, Jager G, Battke F, Vehlow C, Heinrich J, Nieselt K, Sakai R, Aerts J & Ray W. An eQTL biological data visualization challenge and approaches from the visualization community. BMC Bioinformatics 13(8):S8 (2012)

    Application
    For more information and to apply, please contact Jan Aerts (jan.aerts@esat.kuleuven.be, @jandot, +Jan Aerts). If possible, also send screenshots and/or screencasts of previous work.
     
    URL: http://www.kuleuven.be/bioinformatics/

  4. http://www.ftmsglobal.edu.kh/wp-content/uploads/2012/04/Your-Career.jpg

    SymBioSys is a consortium of computational scientists and molecular biologists at the University of Leuven, Belgium focusing on how individual genomic variation leads to disease through cascading effects across biological networks (in specific types of constitutional disorders and cancers). We develop innovative computational strategies for next-generation sequencing and biological network analysis, with demonstrated impact on actual biological breakthroughs.

    The candidate will be a key player in the SymBioSys workpackage that focuses on genomic variation detection based on next-generation sequencing data (454, Illumina, PacBio) using a visual analytics approach (i.e. combining machine learning with interactive data visualization). This includes applying and improving existing algorithms and tools for the detection of structural genomic variation (insertions, deletions, inversions and translocations), as well as developing interactive data visualizations in order to investigate parameter space of these algorithms. These methods will be applied to specific genetic disorders in day-to-day collaboration with the human geneticists within the consortium.

    We offer a competitive package and a fun, dynamic environment with a top-notch consortium of young leading scientists in bioinformatics, human genetics and cancer. Our consortium offers a rare level of interdisciplinarity, from machine learning algorithms and data visualization to fundamental advances in molecular biology, to direct access to the clinic. The University of Leuven is one of Europe’s leading research universities, with English as the working language for research. Leuven lies just east of Brussels, at the heart of Europe.

    Profile
    The ideal candidate holds a PhD degree in bioinformatics/genomics and has good analytical, algorithmic and mathematical skills. Programming and data analysis experience is essential. Prior experience working with sequencing data, i.c. alignment of next-generation data, as well as genome-wide detection of genetic variation would be a distinct advantage. Experience in data visualization - e.g. using tools like D3 (http://d3js.org) or Processing (http://processing.org) - would also be considered a big plus. Good communication skills are important for this role.

    The candidate will collaborate closely with researchers across the consortium and contribute to the reporting of the project. Qualified candidates will be offered the opportunity to work semi-independently under the supervision of a senior investigator, mentor PhD students, and contribute to the acquisition of new funding. A three-year commitment is expected from the candidate. Preferred start date is November/December 2012, so please let us know asap.


    Relevant publications
    • Conrad D, Pinto D, Redon R, Feuk L, Gokumen O, Zhang Y, Aerts J, Andrews D, Barnes C, Campbell P et al. Origins and functional impact of copy number variation in the human genome. Nature 464:704-712 (2010)
    • Medvedev P, Stanciu M & Brudno M. Computational methods for discovering structural variation with next-generation sequencing. Nat Methods 6(11):S13-S20 (2009)
    • Nielsen CB, Cantor M, Dubchak I, Gordon D & Ting W. Visualizing genomes: techniques and challenges. Nat Methods 7:S5-S15 (2010)


    Application
    Please send in PDF: (1) a CV including education (with Grade Point Average, class rank, honors, etc.), research experience, and bibliography, (2) a one-page research statement, and (3) two references (with phone and email) to Dr Jan Aerts (jan.aerts@esat.kuleuven.be), cc Dr Yves Moreau (yves.moreau@esat.kuleuven.be) and Ms Ida Tassens (ida.tassens@esat.kuleuven.be).
     
    URL: http://www.kuleuven.be/bioinformatics/http://www.kuleuven.be/bioinformatics/

    To apply : http://phd.kuleuven.be/set/postdoc/voorstellen_departement?departement=50000516http://phd.kuleuven.be/set/postdoc/voorstellen_departement?departement=50000516
  5. Since the publication of the human genome sequence about a decade ago, the popular press has reported on many occasion about genes allegedly found for things ranging from breast sizeintelligencepopularity and homosexuality to fidgeting. The general population is constantly told that the revolution is just around the corner. But the last year or so, articles start to pop up in the popular press that genomics and genetics will not be able to deliver what it promised (or what people thought it promised) a couple of years ago. The technology of (next-generation) sequencing is clearly following the Gartner Hype Cycle, and we're probably nearing the top of the "peak of inflated expectations".


    Gartner Hype Cycle (taken from TechHui.com)


    As a researcher myself, I also am not insensitive to this sensation. Even though (or because) I have contributed to some large genome sequencing projects (in chicken and cow) and have worked closely with the 1000Genomes team while at the Wellcome Trust Sanger Institute, I feel a growing frustration with genetics and genomics. It's all very interesting to build genomes, find associations of genes with disease, etc, but what can I as an individual do with this information? Yes, our research helps us understand the core of biology, and we do help (parents of) people with rare diseases diagnose those diseases or find the gene that causes the particular congenital condition. But how can this information help the vast majority of the population in their day-to-day life?

    My frustration with what genetics can tell me

    Under the umbrella of GenomesUnzipped, I had about half a million SNPs genotyped by 23andme (for data: see here). Based on that data and scientific literature, for example, they state that I have a higher chance of getting venous thromboembolism (VTE). Almost 18% of European descent with my genotype will develop VTE before they're 80, compared to 12.3% of the general European male population. The heritability of VTE, by the way, is about 55%. So what does this tell me? I should eat healthy, not smoke, and do more exercise. Still: my genotype can not tell me actually when I will get this, not even just before the event.

    The issue is that our genomes are just the blueprints for who we are; they're not us. For that, we need to look at other omes: our transcriptomes, but most of all our proteomes and metabolomes, and our environment. Whatever our genotype is, it has no effect whatsoever unless through how it affects the constitution of the enzymes and other molecules in our cells: proteins could for example not work, or be present in non-optimal amounts.

    Meanwhile...

    When you go to a doctor in the hope to get rid of frequent headaches, what does the doctor base his diagnosis on? Those symptoms? More often than not: your memory. "I think I had those headaches last Friday and Monday, and if I recall correctly the one on Monday was a bit worse than the other one". Doctor: "is it always after eating something specific?". You: "I can't remember". Wouldn't it be great if in such cases your GP could diagnose your disease based on actual data?
    But then comes the next step: taking drugs. The amount of a particular drug (and actually the drug to be taken itself) is based on population-wide averages of efficiency and occurrence of side effects of that drug. It's not based on how you react to that particular drug.

    Quantified Health

    Enter personalized medicine, and more specifically: P4Medicine (predictive, preventive, personalized and participatory). I'd like to look at this from a little broader perspective, as quantified health: data-driven health.

    Since quite a while I've been following the Quantified Self movement (see e.g. quantifiedself.com). The aim of this movement is to improve self knowledge through self tracking: collecting data about yourself to identify trends (e.g. weight loss) and correlations (e.g. linking migraine episodes to triggers). This knowledge can then be used to change someone's behaviour, to predict the onset of disease or episodes thereof, or to prevent it altogether. What if you're smartphone could give you a message on Friday afternoon to get into the dark and drink more fluids otherwise you'd get a migraine episode on that particular Saturday? The big thing here is that any decision would be personalized, and appropriate for that individual, rather than for the majority of the population that that individual belongs to (e.g. male caucasians).

    http://jaeselle.com/wp-content/uploads/2012/05/the-quantified-self.jpeg


    Conceptually, the type of things that we can track can be tracked either externally (i.e. using a fitbit, tracking apps on a smartphone, continuous ECG monitoring) or internally (i.e. using biosensors to follow molecular markers). Working in the omics field, I'm obviously very interested in the latter: what molecules can we easily track in the body that can predict disease? Even as a boy, I fantasized that we could track anything happening in our bodies and use that to stay healthy. D-dimers are a nice example here. These are protein fragments that are produced when a blood clot degrades. Detecting d-dimers has a high sensitivity and negative predictive value for thrombosis (remember that I have a higher genetic predisposition for VTE), but unfortunately a low specificity. This means that if you have a blood clot forming (that could get dislodged) you definitely will show d-dimers; but having d-dimers in your blood does not necessarily mean that you're forming clots. Current practice, where a blood sample is taken when the doctor orders one because there is suspicion of blood clotting, however, has only little value. With age, one starts seeing this molecule in the individual's blood. But if we would monitor this molecule longitudinally (hypothetically: every day), this background/noise would become irrelevant. Hence: quantifying self at the molecular level.

    Nice examples

    If you're interesed, you should definitely check out the research on the Integrated Personal Omics Profile (iPOP, or "Snyderome") by Chen et al (doi: 10.1016/j.cell.2012.02.009), where Michael Snyder's transcriptome, proteome and metabolome were sampled 20 times during a 14-month period. Another astonishing story is that of Eric Alm's year-long gut microbiome tracking (check the video!). Some interesting conclusions based on his (and one of his student's) data! For his student, traveling only changed his gut microbiome transiently. Getting salmonella, however, resulted in a permanent change in Alm's microbiome constitution. Finally, check out these articles on Larry Smarr: "Is health tracking the next big thing?", and "The Measured Man".

    I can also definitely recommend the book Experimental Man by David Ewing Duncan. It shows you what's possible, and most of all what's not (yet) possible.

    I've also created a "Quantified Health" paper.li a while ago, which helps me pick up interesting news.

    Challenges

    To fully implement quantified health, there are still several challenges. As these are basically n-is-one studies, the statistics and data mining will be different. What's even more important is how do we return results and trends to the end user? It will be very important to display the data/results within context rather than just reporting p-values and "above thresholds". On the molecular side, we also need some good use cases. These should

    1. have clear "events" that we can track (something happened or didn't happen),
    2. have useful (combinations of) biomarkers (so that we don't need to do a full-blown discovery phase),
    3. be in a population that is happy to provide the samples (e.g. daily blood sample),
    4. have assays to follow those biomarkers
    From what I'm finding at the moment, it's the first two conditions that are the hardest to meet. Assays can often be developed using antibodies and/or aptamers (really cool technology, BTW), and for a proof-of-principle it should be possible to start with diabetes patients who have to sample their blood periodically anyway. The use of saliva or urine samples would be nice, but unfortunately most biomarkers will be in the blood...

    We're at the brink of truly personalizing medicine and health.
    2

    View comments

  6. Bit of a technical post for my own reference, about visualization and scripting in clojure.

    Clojure and visualization

    Being interested in clojure, a tweet by Francesco Strozzi (@fstrozzi) caught my attention last week: "A D3 like #dataviz project for #clojure. Codename C2 and looks promising. http://keminglabs.com/c2/. They need contribs so spread the word!" I tried a while ago to do some stuff in D3, but the javascript got in the way so I gave up after a while. But I was still pulled towards something html5+css rather than java applets as created by processing.org.

    Although still in a very early stage, C2 is already very powerful. Rapid development of visualizations is aided by the visual repl: the webpage localhost:8987 automatically loads the last modified file in the samples directory.


    (ns rectangles-hover
      (:use [c2.core :only [unify]]))
    
    (def css "
    body { background-color:white;}
    rect { -webkit-transition: all 1s ease-in-out;
           fill:rgb(255,0,0);}
    rect:hover { -webkit-transform: translate(3em,0);
    fill:rgb(0,255,0);}
    circle { opacity: 1;
             -webkit-transition: opacity 1s linear;
             fill:rgb(0,0,255);}
    circle:hover { opacity: 0; }
    ")
    
    [:svg
     [:style {:type "text/css"} (str "<[CDATA[" css "]]>")]
     (unify {"A" 50 "B" 120} (fn [[label val]]
      [:rect {:x val :y val :height 50 :width 60}]))
     (unify {"C" 180 "D" 240} (fn [[label val]]
      [:circle {:cx val :cy val :r 15}]))]
    


    This bit of code draws 2 red rectangles and 2 blue circles. Hovering the mouse over any of the rectangles will move it to the right and change its colour to green; hovering over a circle will make that circle transparent. Some more scripts that I've used to build up simple things and learn C2 are on github.

    Although interactions are not covered in C2 itself, simple transitions can be handled in the CSS part (see the example above). Brushing, linking and other types of interaction would be interesting to have available as well, though. But the developer Kevin Lynagh is very responsive.

    I haven't looked yet into how to run C2 without the visual repl; still on my to-do list. (UPDATE: see end of post)

    Clojure and scripting

    And today, I saw this. Leiningen 2 will allow you to easily execute little clojure scripts without the whole setup of a project. Makes it amenable for pipelining just like you would do with little perl/ruby/python scripts. The completely-useless-but-good-enough-as-proof-of-principle little example below attaches some dashes to the front and stars at the back of anything you throw at it from STDIN.

    #!/bin/bash lein exec
    (doseq [line (line-seq (java.io.BufferedReader. *in*))]
     (println (str "----" line "****")))

    Pipe anything into this:
    ls ~ | ./proof-of-principle.clj

    Dependencies are now stored in ~/.mv2 rather than in the project directory, you can load libraries such as clojure like this:

    #!/bin/bash lein exec
    (use '[leiningen.exec :only (deps)])
    (deps '[[incanter "1.3.0"]])
    
    (use '(incanter core charts stats datasets))
    (save (histogram (sample-normal 1000)) "plot.png")

    This also works in the interactive repl ("lein repl").

    Bringing the two together

    It's really easy to combine these two (after a pointer from C2 Kevin (Thanks!)). You need an additional dependency on hiccup to convert to html, but that's it.

    Here's a script that, when executed with "lein exec this-script.clj" will generate a html file with the interactive picture shown above.

    #!/bin/bash lein exec
    (use '[leiningen.exec :only (deps)])
    (deps '[[com.keminglabs/c2 "0.1.1"] [hiccup "1.0.0"]])
    
    (use '[c2.core :only (unify)])
    (use 'hiccup.core)
    
    (def css "
    body { background-color:white;}
    rect { -webkit-transition: all 1s ease-in-out;
           fill:rgb(255,0,0);}
    rect:hover { -webkit-transform: translate(3em,0);
                 fill:rgb(0,255,0);}
    circle { opacity: 1;
             -webkit-transition: opacity 1s linear;
             fill:rgb(0,0,255);}
    circle:hover { opacity: 0; }
    ")
    
    (def svg [:svg
     [:style {:type "text/css"} (str "")]
     (unify {"A" 50 "B" 120} (fn [[label val]]
      [:rect {:x val :y val :height 50 :width 60}]))
     (unify {"C" 180 "D" 240} (fn [[label val]]
      [:circle {:cx val :cy val :r 15}]))])
    
    (spit "test.html" (html svg))

  7. Finally time to write something about the biovis/visweek conference I attended about a week ago in Providence (RI)... And I must say: they'll see me again next year. (Hopefully @infosthetics will be able to join me then). Meanwhile, several blog posts are popping up discussing it (see here and here, for example).

    This was the first time that biovis (aka the IEEE Symposium on Biological Data Visualization) was organized. It's similar to the 2-year old vizbi, but has an agenda that is more focused on research in visualization rather than application of visualization in the biological sciences. Really interesting talks, posters and people.

    The biovis contest
    This first installment of biovis included a data visualization contest, focusing on "specific biological problem domains, and based on realistic domain data and domain questions". The topic this year was on eQTL (expression quantitative trait loci) data, and I'm really happy that Ryo Sakai -  now a PhD student in my lab - won the "biologists' favourite" award!! The biologist jury was impressed with the ease in which his visualizations of the eQTL data highlighted and confirmed prior knowledge, and how it suggested directions for further experiments. It was interesting to see that there was a huge variation in the submissions, going from just showing the raw data in an interesting way (which Sakai-san did) to advanced statistical and algorithmic munging of the input data and visualizing the end result (which the winner of the "dataviz professionals' favourite" award did). See how this relates to my previous post on humanizing bioinformatics?


    Interesting talks - amazing (good & bad) talks
    As this was the first time that I attended visweek, I was really looking forward to the high quality presentations/papers and posters. Overall, I got what I wanted. But there were some examples of papers and posters that I have major doubts about (taking into account that I have to be humble here in talking about people working in the field for far longer than I do).
    One example that seemed pretty counterintuitive was a presentation by Basak Alper from Microsoft about a new set visualization technique that they baptized LineSets. The main issue that they want to solve is the visualization of intersections of >3 sets (up to 3 you'd just use Venn diagrams). Their approach is to connect the different elements from a set by a line; hence: linesets. However, I (and many others with me) felt that this approach has some very serious drawbacks. Most of all, it suggests that there is an implicit ordering of the elements, which there isn't. In the image below, for example, line sets were used to connect Italian restaurants (in orange) and Mexican restaurants (in purple). That's the only thing this visualization wants to do: tell me which of the restaurants are Italian and which are Mexican. But give this picture to 10 people, and every single one of them will think that the lines are actually paths or routes between these restaurants. Which they're not... The example below shows data that has specific positions on a map, but they demonstrate this approach on social networks as well.
    LineSets
    Another example comes from the biovis conference: TIALA or Time Series Alignment Analysis. Suppose you have the time-dependent gene expression values for a single gene, which you'd plot using a line plot. Now what would you do if you have that type of data for 100 genes? Would you put those plots into 3D? I know I wouldn't... And better still: would you then connect these plots so you end up with some sort of 3D-landscape? That's like connecting the tops of a barchart displaying categorical data with a line...

    TIALA - Time Series Alignment Analysis


    But of course there were plenty really good talks as well. Some of the talks I really enjoyed are those about HiTSEE (by Bertini et al) on the analysis of high-throughput screening experiments, EVEVis (Miller et al) on multi-scale visualization for dense evolutionary data, arc length-based aspect ratio selection (Talbot et al) which is an alternative to banking to 45 degrees, drawing road networks with focus regions (Haunert et al), and especially DICON which showed an amazing application of visual analysis of multidimensional clusters using healthcare data.

    HiTSEE

    EVEVis
    Road networks with focus regions
    DICON - interactive visual analysis of multidimensional clusters

    Meeting interesting people
    But of course this was very much about meeting interesting people as well. It was really nice to exchange ideas again with the biovis crowd (Nils Gehlenborg, Cydney, Tamara, Will Ray, ...), and I finally had the chance to have a chat with @filwd Enrico. All those discussions with Thorri from Icelandic DataMarket were both useful and fun (as was our day hanging out in town, chatting to the Occupy Providence woman (forgot her name, I'm afraid) and trying to find a good hat).
    At the airport on my way back, as I was trying to find out how to get to Brussels (as our flights were cancelled due to the weather), a chap comes to me and introduces himself as someone from Belgium. From Leuven. From our very own faculty. So together with @infosthetics Andrew that now makes three of us :-)

    Anyway: I'll definitely be back next year (have to play some more official role anyway) and already looking forward to it.


  8. I was invited last week to give a talk at this year's meeting of the Graduate School Structure and Function of Biological Macromolecules, Bioinformatics and Modeling (SFMBBM). It ended up being a day with great talks, by some bright PhD students and postdocs. There were 2 keynotes (one by Prof Bert Poolman from Groningen (NL) and one by myself), and a panel discussion on what the future holds for people nearing the end of their PhDs.

    My talk was titled "Humanizing Bioinformatics" and received quite well (at least some people still laughed at my jokes (if you can call them that); even at the end). I put the slides up on slideshare, but I thought I'd explain things here as well, because those slides will probably not convey the complete story.

    Let's ruin the plot by mentioning it here: we need data visualization to counteract the alienation that's happening between bioinformaticians and bright data miners on the one hand, and the user/clinician/biologist on the other. We need to make bioinformatics human again.

    Jim Gray from Microsoft wrote a very interesting book "The Fourth Paradigm - Data-intensive Scientific Discovery". Get it. Read it. He describes how the practice of research has changed over the centuries. In the First Paradigm, science was very much about describing things; the Second Paradigm (last couple of centuries) saw a more theoretical approach, with people like Keppler and Newton defining "laws" that described the universe around them. The last few decades saw the advent of computation in the research field, which allowed us to take a closer look at reality by simulating it (the Third Paradigm). But just recently - so Jim Gray says - we're moving into yet another fundamental way of doing science. We have moved into an age where there is just so much data generated that we don't know what to do with it. This Fourth Paradigm is that of data exploration. As I see it (but that's just one way of looking at it, and it doesn't want to say anything about what's "better" than what), this might be a definition for the difference between computational biology and bioinformatics: computational biology fits within the Third Paradigm, while bioinformatics fits in the Fourth.

    Being able to automatically generate these huge amounts of data (e.g. in genome sequencing) does mean that biologists have to work with ever bigger datasets, using ever more advanced algorithms that use ever more complicated data structures. This is not about just some summary statistics anymore; it's support vector machine recursive feature elimination, manifold learning and adaptive cascade sharing trees and stuff. Result: biologist is at a loss. Remember Dr McCoy in Star Trek saying "Dammit Jim, I'm a doctor, not an electrician/cook/nuclear physicist" whenever the captain let him do stuff that is - well - not doctorly? (Great analogy found by Christophe Lambert). It's exactly the same for a clinician nowadays. In order to do a (his job: e.g. decide on a treatment plan for a cancer patient), he has to first do b (set up hardware that can handle the 100s of Gb of data) and c (devise some nifty data mining trickery to get his results). Neither of which he has the time or training for. "Dammit Jim, I'm a doctor, not a bioinformatician". Result: we're alienating the user. Data mining has become so complicated and advanced, that the clinician is at a complete loss. Heck, I'm working at a bioinformatics department and don't understand half of what they're talking about. So what can the clinician do? His only option is to trust some bioinformatician to come up with some results. But this is a blind trust: he has no way of assessing the results he gets back. This trust is even more blind than the one you give the guy who repairs your car.

    As I see it, there are (at least) four issues.

    What's the question?
    Data generation used to be really geared towards proving or disproving a specific hypothesis. The researcher would have a question, formulate some hypothesis around it, and then generate data. Although that same data could already be used to answer other unanticipated questions as well, this really became an issue with easy, automated data generation; DNA sequencing being a prime example. You might ask yourself "does this or that gene have a mutation that lead to this disease?", but the data you generate (i.c. exome sequences) to answer this question can be used to answer hundreds of other questions as well. You just don't know what questions yet...
    Statistical analysis and data mining are indispensable for (dis)proving hypothesis, but what if we don't know the hypothesis? As many others in the field, I believe that data visualization can give us some clues at what to investigate further.



    Let's for example look at this example hive plot by Martin Krzewinski (for what B means: see the explanation at the hive plot website). Suppose you're given a list of genes in E.coli (or a list of functions in the linux operating system) and the network between those genes (or functions). Using clever visualization, we can define some interesting questions that we can look into using statistics or data mining. For example: why do we see so many workhorse genes in E.coli? Does this reflect reality, and what would that mean? Or does it mean that our input network is biased? What is so special about that very small number of workhorse functions in linux that have that high connectivity? These are questions that we need to be presented to us.

    What parameters should I use?
    Second issue: the outcome from most data mining/filtering algorithms depend tremendously on the right parameters. But it can be very difficult to actually find out what those parameters should be. Does there actually exist a "right" set of parameters for this or that algorithm? Also, tweaking some arguments just a little bit can have vast effects on the results, while you can change other parameters as much as you want, but it won't affect the outcome whatsoever.

    Turnbull et al. Nature Genetics 2010
    Can I trust this output?
    Issue number 3: if I am a clinician/biologist and a bioinformatician hands me some results, how do I know if I can trust those results? Heck, being a bioinformatician myself and writing some program to filter putative SNPs, how do I know that my results are correct? Suppose there are 3 filters that I can apply consecutively, with different combinations of settings.



    Looking at exome data, we main information that we can use for assessing the results of SNP filtering are the fact that you should end up with 20k-25k SNPs, and a transition/transversion ratio of 2.1 (if I remember correctly). But there's many different combinations of filters that can give these summary statistics. The state of the art (believe it or not) is to just run many different algorithms and filters independently, and then take the intersection of the results...

    I can't wrap my head around this...
    And finally, there's the issue of too much information. Not just the sheer amount, but of different data sources. It's actually not really too much information per se, but too much to keep into one head. Someone trying to decide on a treatment plan for a cancer patient, for example, will have to combine data from heterogeneous datasets, multiple abstraction levels and multiple sources. He'll have to look into patient and clinical data, family/population data, MR/CT/Xray scans, tissue samples, gene expression data and pathways. That's just too much. His cognitive capacities are fully engaged in trying to integrate all that information, rather than in answering the initial question.

    Visualization... part of the solution
    I'm not saying anything new here when I suggest that data visualization might be part of the solution to these problems. As current technologies and analysis methods have alienated the end-user from his own results, visualization can reach over and cross this gap. The rest of the presentation is basically about some basic principles in data visualization, which I'll not go further into here.

    All in all, I think the presentation went quite well:







    4

    View comments


  9. Last Friday I received my long-anticipated copy of "Visualize This" by Nathan Yau. On its website it is described as a "practical guide on visualization and how to approach real-world data". You can guess what my weekend looked like :-)

    Overall, I believe this book is a very good choice for people interested in getting started in data visualization. Not only does it provide the context in which to create visualizations (chapters 1, 2 and 9), it also handles different tools for creating them: R, protovis, flash.... Apart from chapter 3 that is dedicated entirely to that topic, different examples in the book were created using different tools, which gives people a good feel of what's possible in each and how "hard" or "easy" the coding itself is for the different options. Different chapters discuss different types of data that you could encounter: patterns over time, proportions, relationships, ...

    There were some minor points in the book that I'd mention if they asked me to review it (but that's according to me, and I don't want to pretend to be an expert). First of all, it would have been nice if Nathan had gone a little bit deeper into theories behind what is seen as good visualization. In the first chapter ("Telling Stories with Data") he does mention Cleveland & McGill in a side-note, but I think that information (along with Gestalt laws, etc) definitely deserves one or two full paragraphs, if not half a chapter. I also don't completely agree with the use of a stacked barchart (about page 109). From my experience, they're worth less than the time it takes to create them. After all, it's impossible to compare any groups other than the one that is at the bottom (and therefore has a common "zero"-line). For example: look at the first picture below. This shows the number of "stupid things done" by women and men, stratified over 5 different groups (A-F). Although it is easy to compare total stupidity per group (group C is doing particularly bad), as well as that for men, we can't see which of the groups A, D or F scores the worst for women. And that's because they don't have a common origin. We could of course put the women next to the men, but then we'd loose the total numbers.


    In the second plot, however, it is possible to compare women, men and totals. The bars for women are put next to those for men, but I've added a shaded larger bar at the back that shows the sum of the two. This plot was originally created in R using ggplot2, but I'm afraid I can't find back the reference that explained how to do this... Let me know if you can find it.



    The contents of the book of course is not world-shattering. But that's not the point of the book. For people new to the field it's a great addition to their library (and I learned a thing or two myself as well). If you're interested in data visualization, go out and get it.
    0

    Add a comment

  10. UPDATE: I encountered a blog post by Martin Theus describing a very similar approach for looking at this same data (see here).

    Disclaimer 1: This is a (very!) quick hack. No effort was put in it whatsoever regarding aesthetics, interactivity, scaling (e.g. in the barcharts), ... Just wanted to get a very broad view of what happened during the Tour de France (= biggest cycling event each year).
    Disclaimer 2: I don't know anything about cycling. It was actually my wife who had to point out to me which riders could be interesting to highlight in the visualization. But that also meant that this could become interesting for me to learn something about the Tour.




    Data was copied from the Tour de France website (e.g. for the 1st stage). Visualization was created in processing.

    The parallel coordinate plot shows the standings of all riders over all 21 stages. No data was available for stage 2, because that was a team time-trial (so discard that one). At the top is the rider who came first, at the bottom who came last. Below the coordinate plot are little barcharts displaying the distribution in arrival time (in "number of seconds later than the winner") for all riders in that stage.

    The highlighted riders are: Cavendish (red), Evans (orange), Gilbert (yellow), Andy Schleck (light blue) and Frank Schleck (dark blue).

    So what was I able to learn from this?

    • Based on the barcharts you can guess which trips were in the mountains, and which weren't. You'd expect that the riders become much more separated in the mountains than on the flat. In the very last stage in Paris, for example, everyone seems to have arrived in one big group. Whereas for stages 12-14 the riders were much more spread. So my guess (and that's confirmed by checking this on the TourDeFrance website :-) is that those were mountain stages.
    • You can see clear groups of riders who behave the same. There is for example a clear group of riders who performed quite badly in stage 19 but much better in stage 20 (and bad in 21 again).
    • As the parallel coordinate plots were scaled according to the initial number of riders, we can clearly see how people left the Tour because the "bottom" of the later stages are empty.
    • We see that Cavendish (red) has very erratic performance. And it seems to co-incide with trips where the arrival times are spread out (= mountain trips?). This could mean that Cavendish is good on the flats, but bad in the mountains. Question to those who know something about cycling: is that true?
    • Philippe Gilbert started good (both on the flats and in the mountains), but became more erratic halfway through the Tour.
Welcome
Welcome
Hi there, and welcome to SaaienTist, a blog by me, for me and you. It started out long ago as a personal notebook to help me remind how to do things, but evolved to cover more opinionated posts as well. After a hiatus of 3 to 4 years (basically since I started my current position in Belgium), I resurrect it to help me organize my thoughts. It might or might not be useful to you.

Why "Saaien tist"? Because it's pronounced as 'scientist', and means 'boring bloke' in Flemish.
About Me
About Me
Tags
Blog Archive
Links
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.