Suppose ten stock market timing strategies out of 10,000 beat the market for ten years running. Are they true outperformers, or just lucky? Multiple hypothesis testing methods address that question by controlling for luck. What are these methods, and how should researchers use them? In their November 2019 paper entitled “An Evaluation of Alternative Multiple Testing Methods for Finance Applications”, Campbell Harvey, Yan Liu and Alessio Saretto:
- Address the scope of the multiple testing problem in empirical financial economics.
- Summarize multiple testing methods based on conventional (frequentist) hypothesis testing.
- Simulate performance of different methods across a variety of testing environments.
Their goal is to provide a menu of choices to help researchers improve inference in financial economics. Based on theory and simulations, they conclude that:
- Many published findings in economic finance are likely false due to lack of control for multiple testing (luck). Lack of records for tests that lead to no interesting results exacerbates this issue.
- Two streams of research most impacted by multiple testing are fund evaluation and factor/anomaly testing. For example, with thousands of funds to be tested, it is obvious that some would outperform purely by luck.
- Conventional methods to control for multiple testing generally raise the statistical significance threshold to suppress false positives and false negatives by controlling the:
- Probability of rejecting more than one true null hypothesis.
- Proportion of false discoveries.
- Average realized proportion of false discoveries.
- Bayesian methods address multiple testing via posterior probabilities.
- Relative performances of these methods are specific to the data sample, confounding general guidance.
In summary, multiple hypothesis testing is endemic in economic finance, and the research community does a poor job of controlling for it. The best approach for such control is sample specific.
Some related considerations are:
- Multiple testing control methods are unfamiliar to many investors, who therefore create and rely on uncountable misleading tests.
- Personal computing and public networks have greatly amplified multiple testing in recent decades.
- As noted in the paper, it is extremely difficult to document the number of tests, recorded and unrecorded, actually performed on a dataset.
For additional detail on methods, see “Methods for Mitigating Data Snooping Bias”.