Investors typically employ backtests to estimate future performance of investment strategies. Two approaches to assess in-sample optimization bias in such backtests are:
- Reserve (hold out) some of the historical data for out-of-sample testing. However, surreptitious direct use or indirect use (as in strategy construction based on the work of others) of hold-out data may contaminate its independence. Moreover, small samples result in even smaller in-sample and hold-out subsamples.
- Randomize the data for Monte Carlo testing, but randomization assumptions may distort the data and destroy real patterns in them. And, the process is time-consuming.
Is there a better way to assess data snooping bias? In their September 2013 paper entitled “The Probability of Backtest Overfitting”, David Bailey, Jonathan Borwein, Marcos Lopez de Prado and Qiji Zhu derive an approach for assessing the probability of backtest overfitting that depends on the number of trials (strategy alternatives) employed to select it. They use Sharpe ratio to measure strategy attractiveness. They define an optimized strategy as overfitted if its out-of-sample Sharpe ratio is less than the median out-of-sample Sharpe ratio of all strategy alternatives considered. By this definition, overfitted backtests are harmful. Their process is very general, specifying multiple (in-sample) training and (out-of-sample) testing subsamples of equal size and reusing all training sets as testing sets and vice versa. Based on interpretation of mathematical derivations, they conclude that:
- The proposed approach is superior to conventional hold-out and randomization testing in that it:
- Explicitly takes into account the number of strategy alternatives considered (data snooping bias).
- Considers multiple start and stop dates for both in-sample and out-of-sample tests rather than a single arbitrary (and potentially lucky or unlucky) break point.
- Puts in-sample and out-of-sample testing on equal footing based on subsample length and role reversal.
- Preserves any serial correlation and seasonality actually present in the data.
- Empirical tests indicate that the proposed approach accurately detects overfitting.
- In applying the proposed approach, researchers should consider as many plausible strategy alternatives as possible, but must collect the out-of-sample performances of all alternatives in order to estimate the probability of overfitting.
In summary, mathematical derivation indicates that investors can estimate the probability that a strategy selected as optimal based on iterative backtests is harmfully overfitted by rigorously comparing its out-of-sample performance to the median out-of-sample performance of all strategy alternatives considered.
Cautions regarding conclusions include:
- Tracking the performance of all strategy alternatives considered requires considerable discipline and effort. It is not obvious that strategy offerors have sufficient incentive to do it.
- It may not be possible to estimate derived overfitting, whereby research done by others (who do not document the strategy alternatives they considered) stimulates selection of a strategy for further testing. In other words, imprecise communications among researchers makes the number of strategy alternatives tested against a given data set impossible to know.
- Assumptions (including use of Sharpe ratio) and derivations generally assume tame variable distributions. To the extent that these distributions are wild, interpretation of statistics break down.
See also “Stock Return Model Snooping” and “Taming the Factor Zoo?”.