It Can’t All Be Data Snooping?

Steve LeCompte | December 7, 2018 | Posted in: Big Ideas

Key Insight

Chen estimates data snooping alone would require 100 trillion unpublished factors to explain 300+ published return predictors, implying 10,000 researchers 15 million years (2018 calibration; pure luck model).

Is it possible that all the 300+ published factors that predict stock returns (such as size, value, profitability, investment, momentum…) derive from data snooping? In his October 2018 paper entitled “The Limits of Data Mining: A Thought Experiment”, Andrew Chen estimates how much data snooping would be required to “discover” all these factors by pure luck. Specifically, he calibrates a pure luck model built on the assumption that the probability of publishing a factor discovery increases with the degree to which the discovery is convincing (t-statistic). Using this model, he estimates the number of unpublished factor studies required for the published set to be attributable to pure luck. He considers two sets of factor t-statistics: 156 from factor replications via equal-weighted long-short extreme fifths (quintiles) of factor stock sorts; and, a hand-collected set of 316 from published factor studies. Using the specified approach and these two sets of t-statistics, he finds that:

The pure luck model indicates that the ratio of unpublished to published factors is about 100 trillion to 1.
Attributing the published set of factors to pure luck would therefore take 10,000 researchers, each generating a factor per minute, 15 million years.

In summary, sanity checks indicate that data snooping alone cannot explain the large zoo of stock return anomalies.

Cautions regarding findings include:

The analysis assumes that discovered factors are independent, which may not be correct. In other words, the number of known independent factors may be much smaller than the 156 and 316 used in the paper.
The analysis ignores systematic limitations in academic factor backtests related to trading frictions and shorting costs/constraints. Most studies ignore these costs. Moreover, the broad use of equal-weight portfolios to generate t-statistics (magnifying the importance of microcaps) exacerbates these limitations. In other words, many published factors may be discoveries not of return anomalies but limits to arbitrage of gross stock return differences.
The community of factor researchers is global, not limited to the U.S. as implied in the paper.

Further Reading