Tuesday, 20 October 2015


Quite a while ago, I have received an email by Samantha R. from Udemy pointing me towards this article, discussing the "difference between data science and statistics" (I have to confess that I don't really know Udemy, apart from having looked at that article and, despite having quickly searched for her, I wasn't able to find any link or additional information). Given he has asked me to comment on the article, which I do now with over a month delay $-$ apologies Samantha, if you're reading this!

So: I have to say that, while I don't think it's fair or wise to just discard the whole of "data science" as a re-branding of statistics, I don't agree 100% with some of the points raised in the article. For example, I am not sure I buy the distinction between statistics as a discipline of the old world and data science (DS) as one for the modern world. Certainly, if a fundamental connotation of DS is computing, then obviously it will be relevant to the modern world, where cheap(er) and (very) powerful computers are available. But I certainly do not think that this does not apply to statistics too.

I am not sure about the distinction between "dealing" and "analysing" data, either. In my day-to-day job as a (proud!) statistician, I do have to do lots of dealing with data $-$ one of the most obvious example is our work on administrative databases (eg THIN for our work on the Regression Discontinuity Design in epidemiology); eventually, the dataset becomes very rich and with lots of potential for interesting analysis. But how we get there is an equally long (if not longer!) process in which we do lots of dealing with the "dirt".

The third point I'm really not convinced by is when Samantha says that "Statistics, on the other hand, has not changed significantly in response to new technology. The field continues to emphasize theory, and introductory statistics courses focus more on hypothesis testing than statistical computing." Seems to me that this is far from true and actually we do place a lot more emphasis on computing than we used to 10-15 years ago in our introductory courses. And computation is playing a more and more central role in the development of statistics $-$ with amazing results, a couple I'm more familiar with: Stan and INLA. I would definitely see these developments as statistics $-$ definitely not DS.

In general, I think that the main premise of DS (as I understand it) that data should be queried to tell you things takes away basically all the fun of my job, which is about modelling, making assumptions which you need to carefully justify so that other people are persuaded that they are reasonable. 

Still, I think there's plenty of data for statisticians and data scientists to co-inhabit this world and I most certainly don't take a "Daily Mail-esque" view that data scientists are coming to take our jobs and stealing our women. I think am allowed to say this, as somebody who has actually come to a different country to steal their statistical jobs $-$ at least I had the decency of bringing my own woman with me (well, actually Marta did bring her man with her as when we moved to London she was the one with a job. But that's another story...). 

No comments:

Post a Comment