How can researchers account for the snooping bias derived from testing of multiple strategy alternatives on the same set of data? In the July 2014 version of their paper entitled “Evaluating Trading Strategies”, Campbell Harvey and Yan Liu describe tools that adjust strategy evaluation for multiple testing. They note that conventional thresholds for statistical significance assume an independent (single) test. Applying these same thresholds to multiple testing scenarios induces many false discoveries of “good” trading strategies. Evaluation of multiple tests requires making significance thresholds more stringent. In effect, such adjustments mean demanding higher Sharpe ratios or, alternatively, applying “haircuts” to computed strategy Sharpe ratios according to the number of strategies tried. They consider two approaches: one that aggressively excludes false discoveries, and another that scales avoidance of false discoveries with the number of strategy alternatives tested. Using mathematical derivations and examples, they conclude that:
- Because of ubiquitous multiple testing, a two-sigma requirement (95% confidence level) is no longer a strict enough threshold for evaluating trading strategies.
- Tractable methods of adjusting Sharpe ratios to account for multiple testing are available. A methodology that scales the required confidence level with the number of strategy alternatives tried is most appropriate for evaluating trading strategies.
- Because researchers generally ignore implications of multiple testing:
- Most empirical findings in finance, whether published in academic journals or actively pursued by investment managers, are likely false.
- Half the claims of outperformance among financial products currently offered to investors are likely false.
In summary, financial researchers and investors should recognize that testing of multiple strategies on the same data elevates the probability of “discovering” luck, requiring more stringent statistical tests than conventionally applied.
Cautions regarding findings include:
- As noted in the paper, to the degree the return distribution for a strategy is non-normal (exhibits skewness or kurtosis), Sharpe ratio is not an appropriate evaluation metric.
- As noted in the paper, in actual practice, an investor with a portfolio of strategies needs to examine not just how a new strategy performs on its own, but also how it interacts with the current portfolio of strategies.
- As noted in the paper, correlations of returns within a set of alternative strategies being tried affects the level of snooping bias. The higher the pairwise correlations (the less distinct the alternatives), the lower the snooping bias.
- The methodologies covered in the paper do not account for secondary snooping bias, which derives from starting with findings from other research for which the level of snooping is unknown.
For closely related papers, see “Taming the Factor Zoo?” and “Navigating the Data Snooping Icebergs”. See also “Chapter 3: Avoiding or Mitigating Snooping Bias”.