I was invited last week to give a talk at this year's meeting of the Graduate School Structure and Function of Biological Macromolecules, Bioinformatics and Modeling (SFMBBM). It ended up being a day with great talks, by some bright PhD students and postdocs. There were 2 keynotes (one by Prof Bert Poolman from Groningen (NL) and one by myself), and a panel discussion on what the future holds for people nearing the end of their PhDs.

My talk was titled "Humanizing Bioinformatics" and received quite well (at least some people still laughed at my jokes (if you can call them that); even at the end). I put the slides up on slideshare, but I thought I'd explain things here as well, because those slides will probably not convey the complete story.

Let's ruin the plot by mentioning it here: we need data visualization to counteract the alienation that's happening between bioinformaticians and bright data miners on the one hand, and the user/clinician/biologist on the other. We need to make bioinformatics human again.

Jim Gray from Microsoft wrote a very interesting book "The Fourth Paradigm - Data-intensive Scientific Discovery". Get it. Read it. He describes how the practice of research has changed over the centuries. In the First Paradigm, science was very much about describing things; the Second Paradigm (last couple of centuries) saw a more theoretical approach, with people like Keppler and Newton defining "laws" that described the universe around them. The last few decades saw the advent of computation in the research field, which allowed us to take a closer look at reality by simulating it (the Third Paradigm). But just recently - so Jim Gray says - we're moving into yet another fundamental way of doing science. We have moved into an age where there is just so much data generated that we don't know what to do with it. This Fourth Paradigm is that of data exploration. As I see it (but that's just one way of looking at it, and it doesn't want to say anything about what's "better" than what), this might be a definition for the difference between computational biology and bioinformatics: computational biology fits within the Third Paradigm, while bioinformatics fits in the Fourth.

Being able to automatically generate these huge amounts of data (e.g. in genome sequencing) does mean that biologists have to work with ever bigger datasets, using ever more advanced algorithms that use ever more complicated data structures. This is not about just some summary statistics anymore; it's support vector machine recursive feature elimination, manifold learning and adaptive cascade sharing trees and stuff. Result: biologist is at a loss. Remember Dr McCoy in Star Trek saying "Dammit Jim, I'm a doctor, not an electrician/cook/nuclear physicist" whenever the captain let him do stuff that is - well - not doctorly? (Great analogy found by Christophe Lambert). It's exactly the same for a clinician nowadays. In order to do a (his job: e.g. decide on a treatment plan for a cancer patient), he has to first do b (set up hardware that can handle the 100s of Gb of data) and c (devise some nifty data mining trickery to get his results). Neither of which he has the time or training for. "Dammit Jim, I'm a doctor, not a bioinformatician". Result: we're alienating the user. Data mining has become so complicated and advanced, that the clinician is at a complete loss. Heck, I'm working at a bioinformatics department and don't understand half of what they're talking about. So what can the clinician do? His only option is to trust some bioinformatician to come up with some results. But this is a blind trust: he has no way of assessing the results he gets back. This trust is even more blind than the one you give the guy who repairs your car.

As I see it, there are (at least) four issues.

What's the question?
Data generation used to be really geared towards proving or disproving a specific hypothesis. The researcher would have a question, formulate some hypothesis around it, and then generate data. Although that same data could already be used to answer other unanticipated questions as well, this really became an issue with easy, automated data generation; DNA sequencing being a prime example. You might ask yourself "does this or that gene have a mutation that lead to this disease?", but the data you generate (i.c. exome sequences) to answer this question can be used to answer hundreds of other questions as well. You just don't know what questions yet...
Statistical analysis and data mining are indispensable for (dis)proving hypothesis, but what if we don't know the hypothesis? As many others in the field, I believe that data visualization can give us some clues at what to investigate further.



Let's for example look at this example hive plot by Martin Krzewinski (for what B means: see the explanation at the hive plot website). Suppose you're given a list of genes in E.coli (or a list of functions in the linux operating system) and the network between those genes (or functions). Using clever visualization, we can define some interesting questions that we can look into using statistics or data mining. For example: why do we see so many workhorse genes in E.coli? Does this reflect reality, and what would that mean? Or does it mean that our input network is biased? What is so special about that very small number of workhorse functions in linux that have that high connectivity? These are questions that we need to be presented to us.

What parameters should I use?
Second issue: the outcome from most data mining/filtering algorithms depend tremendously on the right parameters. But it can be very difficult to actually find out what those parameters should be. Does there actually exist a "right" set of parameters for this or that algorithm? Also, tweaking some arguments just a little bit can have vast effects on the results, while you can change other parameters as much as you want, but it won't affect the outcome whatsoever.

Turnbull et al. Nature Genetics 2010
Can I trust this output?
Issue number 3: if I am a clinician/biologist and a bioinformatician hands me some results, how do I know if I can trust those results? Heck, being a bioinformatician myself and writing some program to filter putative SNPs, how do I know that my results are correct? Suppose there are 3 filters that I can apply consecutively, with different combinations of settings.



Looking at exome data, we main information that we can use for assessing the results of SNP filtering are the fact that you should end up with 20k-25k SNPs, and a transition/transversion ratio of 2.1 (if I remember correctly). But there's many different combinations of filters that can give these summary statistics. The state of the art (believe it or not) is to just run many different algorithms and filters independently, and then take the intersection of the results...

I can't wrap my head around this...
And finally, there's the issue of too much information. Not just the sheer amount, but of different data sources. It's actually not really too much information per se, but too much to keep into one head. Someone trying to decide on a treatment plan for a cancer patient, for example, will have to combine data from heterogeneous datasets, multiple abstraction levels and multiple sources. He'll have to look into patient and clinical data, family/population data, MR/CT/Xray scans, tissue samples, gene expression data and pathways. That's just too much. His cognitive capacities are fully engaged in trying to integrate all that information, rather than in answering the initial question.

Visualization... part of the solution
I'm not saying anything new here when I suggest that data visualization might be part of the solution to these problems. As current technologies and analysis methods have alienated the end-user from his own results, visualization can reach over and cross this gap. The rest of the presentation is basically about some basic principles in data visualization, which I'll not go further into here.

All in all, I think the presentation went quite well:







4

View comments

  1. Very interesting post. I would additionally suggest that it's not just a case of ensuring that the outputs of data exploration are digestible to users, but involving users in the exploration process.

    We also need a continuum of people with overlapping areas of knowledge/expertise, rather than distinct camps trying to find best ways of communicating with each other. I'm an epidemiologist/clinical outcomes researcher so I figure I sit somewhere in the middle of the spectrum. While my epi training helps me get to grips with what the clinicians want/need to know, it hasn't equipped me with sufficient statistical/informatics tools to handle the 'big data' that I now have access to. I don't (yet) have the toolkit to match my imagination regarding what I could do with this data. I'm working to change that, but I think we need more people like yourself whose expertise cuts across different knowledge domains/tribes.

    Thanks for recommending that book - looks like useful further reading. :)

    ReplyDelete
  2. I completely agree with your post, especially with regards to the visualization. Something that I noticed is the number of answers that can emerge from an appropriate visualization, even to question that we didn't think about when analyzing the data. I previously worked on geneset analysis from microarray data, and one thing that I want to point, in agreement with your post, is that there exist several ways to "group" genes with regards to specific questions (same transcription factor, signature genes for pathology, pathway involved, ...) and the criteria that is used to group genes into sets has consequences on the mathematical structure of the data subset analyzed. As several algorithms exists to analyze gene-sets, relying of several models, it is very informative to compare the assumptions of each algorithm, as some are especially appropriate for some types of mathematical structure (coordinated changes with correlated values when analysing signature genes previously found from clustering studies, best analyzed with multivariate procedure focusing on the sample categories, and, on the opposite, genes belonging to common pathway may be regulated differently, such that in a multivariate analysis, the interaction sample * gene is the best way to detect it...).

    ReplyDelete
  3. Spot on. Great post.

    As an employee in a bioinformatics firm (we develop software and provide services for the health industry sector, academia and private research centers) this is a problem we find on a daily basis. It's common to have sequencing services and customers who just have compiled big datasets and are at a complete loss on what to do next.

    From my experience, the solution comes most of the time in taking the time to explain the different computational and visualization approaches that one can apply (i.e. kernel methods, graph analysis, language-theory approaches with training stages, etc.) and what sort of information you are able to extract with each one of them. Right now, as it has been previously commented, there is a big need for multidisciplinary teams who can have, at least, a common lingo. There has been much talk about biologists, physicians or geneticists needing to develop their computer/programming skills, but computer scientists and statisticians, on the other end, need also to improve and get a deeper understanding on what it is they are modeling. Taking cold, raw data takes you just so far.

    ReplyDelete
  4. I am working in a lab that produces large scale (interaction) data so these sort of questions come up a lot with collaborators. Many people expect a "magical button" that, when pressed, will present them with the insight(TM) to be gained from their dataset. There is huge disconnect between the data gathering methods, the questions and the analysis tools. A lot of frustrated expectations.

    ReplyDelete

Ryo Sakai reminded me a couple of weeks ago about Simon Sinek's excellent TED talk "Start With Why - How Great Leaders Inspire Action"; which inspired this post... Why do I do what I do?

The way data can be analysed has been automated more and more in the last few decades. Advances in machine learning and statistics make it possible to gain a lot of information from large datasets. But are we starting to rely to much on those algorithms? Different issues seem to pop up more and more. For one thing, research in algorithm design has enabled many more applications, but at the same time makes these so complex that they start to operate as black boxes. Not only to the end-user who provides the data, but even for the algorithm developer.
2

"I'll do Angelina Jolie". Never thought I'd say that phrase while talking to well-known Belgian cartoonists, and actually be taken serious.

Backtrack about one year. We're at the table with the crème-de-la-crème of Belgium's cartoon world (Zaza, Erwin Vanmol, LECTRR, Eva Mouton, ...), in a hotel in Knokke near the coast.  "We" is a gathering of researchers covering genetics, bioinformatics, ethics, and law. The setup: the Knokke-Heist International Cartoon Festival.

We could still use more applicants for this position, so bumping the open position...

SymBioSys is a consortium of computational scientists and molecular biologists at the University of Leuven, Belgium focusing on how individual genomic variation leads to disease through cascading effects across biological networks (in specific types of constitutional disorders and cancers). We develop innovative computational strategies for next-generation sequencing and biological network analysis, with demonstrated impact on actual biological breakthroughs.

Since the publication of the human genome sequence about a decade ago, the popular press has reported on many occasion about genes allegedly found for things ranging from breast size, intelligence, popularity and homosexuality to fidgeting. The general population is constantly told that the revolution is just around the corner.
2

Bit of a technical post for my own reference, about visualization and scripting in clojure.

Clojure and visualization

Being interested in clojure, a tweet by Francesco Strozzi (@fstrozzi) caught my attention last week: "A D3 like #dataviz project for #clojure. Codename C2 and looks promising. http://keminglabs.com/c2/. They need contribs so spread the word!" I tried a while ago to do some stuff in D3, but the javascript got in the way so I gave up after a while.

Finally time to write something about the biovis/visweek conference I attended about a week ago in Providence (RI)... And I must say: they'll see me again next year. (Hopefully @infosthetics will be able to join me then). Meanwhile, several blog posts are popping up discussing it (see here and here, for example).

This was the first time that biovis (aka the IEEE Symposium on Biological Data Visualization) was organized.

I was invited last week to give a talk at this year's meeting of the Graduate School Structure and Function of Biological Macromolecules, Bioinformatics and Modeling (SFMBBM). It ended up being a day with great talks, by some bright PhD students and postdocs. There were 2 keynotes (one by Prof Bert Poolman from Groningen (NL) and one by myself), and a panel discussion on what the future holds for people nearing the end of their PhDs.
4

Last Friday I received my long-anticipated copy of "Visualize This" by Nathan Yau. On its website it is described as a "practical guide on visualization and how to approach real-world data". You can guess what my weekend looked like :-)

Overall, I believe this book is a very good choice for people interested in getting started in data visualization.

UPDATE: I encountered a blog post by Martin Theus describing a very similar approach for looking at this same data (see here).

Disclaimer 1: This is a (very!) quick hack. No effort was put in it whatsoever regarding aesthetics, interactivity, scaling (e.g. in the barcharts), ... Just wanted to get a very broad view of what happened during the Tour de France (= biggest cycling event each year).

Disclaimer 2: I don't know anything about cycling.
Welcome
Welcome
Hi there, and welcome to SaaienTist, a blog by me, for me and you. It started out long ago as a personal notebook to help me remind how to do things, but evolved to cover more opinionated posts as well. After a hiatus of 3 to 4 years (basically since I started my current position in Belgium), I resurrect it to help me organize my thoughts. It might or might not be useful to you.

Why "Saaien tist"? Because it's pronounced as 'scientist', and means 'boring bloke' in Flemish.
About Me
About Me
Tags
Blog Archive
Loading