[Paper Exploration] Statistical Modeling: The Two Cultures

Abstract

There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.

Author: Leo Breiman

Published on 2001

Leo Breiman

Leo Breiman was an influential American statistician and professor, best known for his significant contributions to the field of statistics and machine learning.
Breiman made significant contributions to various areas of statistics, including classification and regression trees, ensemble learning methods, and random forests.
One of Breiman’s most notable contributions is the development of the Random Forest algorithm, introduced in his seminal paper “Random Forests” published in 2001.
His ideas continue to be studied, extended, and applied in various domains, contributing to the advancement of data science and predictive modeling.

The Two Cultures

Statistics start with data
Nature functions to associate the predictor variables with the response variables
There are two goals in analyzing the data:
- Prediction: To be able to predict what the responses are going to be to future input variables
- Information: To extract some information about how nature is associating the response variables to the input variables.

The Data Modeling Culture

The analysis in this culture starts with assuming a stochastic data model for the inside of the black box
Response variables = f(predictor variables, random noise, parameters)
The values of the parameters are estimated from the data and the model then used for information and/or prediction.
Model validation. Yes–no using goodness-of-fit tests and residual examination.
Estimated culture population. 98% of all statisticians.

The Algorithmic Modeling Culture

The analysis in this culture considers the inside of the box complex and unknown.
Their approach is to find a function fx—an algorithm that operates on x to predict the responses y.
Model validation. Measured by predictive accuracy.
Estimated culture population. 2% of statisticians, many in other fields.

Motivation (Why review this paper?)

I graduated with a Bachelors in Mathematics when the algorithmic modeling culture was not commonplace.
I graduated with a Masters in Machine Learning at a time where the algorithmic modeling culture everywhere.
I work in a company that delivers Deep Learning solutions, but requires data modeling culture for its own solution-ing (experimentation, distribution, sampling, etc.).
I work with students in Nepal who want to implement Large Language Models, sophisticated Deep Learning Models, but do not want to learn about foundational statistics/linear algebra/calculus/optimization.
In many ways, I am currently trying to ask students in Nepal to do the exact opposite of what Breiman had to do with this paper. The 98%-2% might have flipped on its head. I do not like it.

For the last point, I am not implying that prediction accuracy is not important, or should not be pursued. I am implying that it is not the only thing to be pursued. That persuasion has several shortcuts with the advancement of ML packages. This not only makes the data model a black box, but also make the machine learning implementation a black box.

Breiman’s call to statisticians to join the 2%

Breiman argues that the focus in the statistical community on data models has:
- Led to irrelevant theory and questionable scientific conclusions
- Kept statisticians from using more suitable algorithmic models
- Prevented statisticians from working on exciting new problems

Breiman as a consultant

Breiman’s experiences as a consultant formed his views about algorithmic modeling
Breiman’s perceptions on Statistical Analysis:
- Focus on finding a good solution—that’s what consultants get paid for.
- Live with the data before you plunge into modeling.
- Search for a model that gives a good solution, either algorithmic or data.
- Predictive accuracy on test sets is the criterion for how good the model is.
- Computers are an indispensable partner

Breiman after returning to University

Ayush Subedi

[Paper Exploration] Statistical Modeling: The Two Cultures

[Paper Exploration] Statistical Modeling: The Two Cultures

Abstract

Leo Breiman

The Two Cultures

The Data Modeling Culture

The Algorithmic Modeling Culture

Motivation (Why review this paper?)

Breiman’s call to statisticians to join the 2%

Breiman as a consultant

Breiman after returning to University

Rashomon Effect and the Data Modeling Culture

Occam’s razor and the Data Modeling Culture

The curse of dimensionality the Data Modeling Culture

Conclusion