Ryo Sakai reminded me a couple of weeks ago about Simon Sinek's excellent TED talk "Start With Why - How Great Leaders Inspire Action"; which inspired this post... Why do I do what I do?
The way data can be analysed has been automated more and more in the last few decades. Advances in machine learning and statistics make it possible to gain a lot of information from large datasets. But are we starting to rely to much on those algorithms? Different issues seem to pop up more and more. For one thing, research in algorithm design has enabled many more applications, but at the same time makes these so complex that they start to operate as black boxes. Not only to the end-user who provides the data, but even for the algorithm developer. Another issue with pre-defined algorithms is that having these around precludes us to identifying unexpected patterns. If the algorithm or statistical test is not specifically written to find a certain type of pattern, it will not find it. Third issue: (arbitrary) cutoffs. Many algorithms rely heavily on the user (or even worse: the developer) defining a set of cutoff values. This is true in machine learning as well as statistics. A statistical test returning a p-value of 4.99% is considered "statistically significant", but you'd throw away your data if that p-value were 5.01%. What's the intrinsic thing at 5% that makes you have to choose between "yes, this is good" and "let's throw our hypothesis out the window"? All in all, much of this comes back to the fragility of using computers (hat tip to Toni for the book by Nassim Taleb): you have to tell them what to do and what to expect. They're not resilient to changes in setting, data, prior knowledge, etc; at least not as much as we are.
So where does this bring us? It's my firm belief that we need to put the human back in the loop of data analysis. Yes, we need statistics. Yes, we need machine learning. But also: yes, we need a human individual to actually make sense of the data and drive the analysis. To make this possible, I focus on visual design, interaction design, and scalability. Visual design because the representation of data in many cases needs improvement to be able to cope with high-dimensional data; interaction design because it's often by "playing" with the data that the user can gain insights; and scalability because it's not trivial to process big data fast enough that we can get interactivity.
View comments