Credit: Adobe Stock Images

2001: A Data Culture War

Dan Lee
3 min readJan 6, 2022

--

When it comes to the world of data science, I would still consider myself a baby. It’s been about nine months since I’ve delved into this sphere, so at least I could be considered a full-term baby. One of the best parts of learning a new discipline is discovering the classics. Leo Breiman’s paper ‘Statistical Modeling: The Two Cultures’ (2001) is one of them. Breiman left academia to be a free-lance probabilist, and after 13 years returned back to academia to find a world of difference in the methods used to solve data problems.

Old School vs. New School

Breiman takes aim at the methods those in his field treat as scripture. There has been such an emphasis on data modeling and he believes the pendulum has swung too far. He makes it quite clear that he believes there are appropriate scenarios for using linear regression or logistic regression. However, the culture has yolked themselves to data modeling so much that it has become a crutch. He quotes the old saying “If all a man has is a hammer, then every problem looks like a nail.” By leaning too much on these methods, scientists are left using the wrong tool to solve the problem at hand.

Enter in more tools for the toolbox. Breinman introduces algorithmic models such as random forests and support vector machines, giving brief yet detailed descriptions of their methodology. The emphasis however is not on methodology, it’s on performance. A number of real-life examples are given where datasets previously analyzed using data modeling were completely missing the mark. In two scenarios, variables were completely misidentified in their predictive importance, whereas the very simple method of bootstrap sampling easily identified the correct variables.

Handling the Black Box

Breiman’s hypothesis describes two different philosophies. The black box can be described as nature’s unknown operations. In data analysis, when variables are acted upon by nature, the results are then measured. What data modeling does is attempts to describe how nature has acted upon the variables through fitting these operations to a model. Breiman’s critique of this philosophy is that nature is far too mysterious and unknown to always be described by such a model (especially where simplicity is favored).

What algorithmic modeling does is it leaves nature as an unknown, and only worries about creating a function that can accurately predict the result based on the variables. It completely bypasses the black box.

Breiman believes that many scientists will balk at this new methodology because they value interpretability. It doesn’t matter how accurate predictions are if you can’t explain why it works. Instead, Breiman contends that the chief role of a scientist is to hold accuracy first, then worry about understanding the results later. His years in consulting no doubt influenced this, because his clients paid him to be right.

Final Thoughts

Breiman’s paper is an excellent read for anyone who is starting to get a handle on data science methods and terminology. It is fascinating to read the thoughts and arguments of an old school statistician who has embraced the new school of algorithmic models and realize the power they wield in transforming and creating the new field of data science.

--

--