How does a large sample of stock return anomalies fare in recent replication testing? In their October 2018 paper entitled “Replicating Anomalies”, Kewei Hou, Chen Xue and Lu Zhang attempt to replicate 452 published U.S. stock return anomalies, including 57, 69, 38, 79, 103, and 106 anomalies 57 momentum, 69 value-growth, 38 investment, 79 profitability, 103 intangibles and 106 trading frictions (trading volume, liquidity, market microstructure) anomalies. Compared to the original papers, they use the same sample populations, original (as early as January 1967) and extended (through 2016) sample periods and similar methods/variable definitions. They test limiting influence of microcaps (stocks in the lowest 20% of market capitalizations) by using NYSE (not NYSE-Amex-NASDAQ) size breakpoints and value-weighted returns. They consider an anomaly replication successful if average high-minus-low tenth (decile) return is significant at the 5% level, translating to t-statistic at least 1.96 for pure standalone tests and at least 2.78 assuming multiple testing (accounting for aggregate data snooping bias). Using required anomaly data and monthly returns for U.S. non-financial stocks during January 1967 through December 2016, they find that:
- With extended sample periods and influence of microcaps limited by using NYSE size breakpoints and value-weighted returns:
- For the standalone testing threshold (t-statistic at least 1.96):
- 65% of all anomalies fail replication, including debt-to-market ratio, five-year sales growth, long-term analyst forecasts, dispersion in analyst forecasts, O-score and Z-score, Piotroski’s fundamental score, corporate governance index, earnings persistence, earnings conservatism, accruals quality, total accruals, failure probability and operating profitability.
- In the trading frictions category, 102 out of 106 anomalies (96%) fail standalone replication tests, including short-term reversal, share turnover, variation in dollar trading volume, absolute return-to-volume, probability of informed trading, liquidity betas, idiosyncratic volatility, total volatility, systematic volatility, number of zero daily trading volume days, maximum daily return, high-low bid-ask spread and tail risk.
- For the multiple testing threshold (t-statistic at least 2.78), overall anomaly replication failure rate is 82%.
- For the standalone testing threshold (t-statistic at least 1.96):
- With extended sample periods and limitations on influence of microcaps removed (NYSE-Amex-NASDAQ size breakpoints and equal-weighted returns), for the standalone testing threshold:
- Overall anomaly replication failure rate is 52%.
- Trading frictions anomaly replication failure rate is 74%.
- Average high-minus-low decile returns are much smaller than originally reported for successfully replicated anomalies, including price momentum, cash flow-to-price, operating accruals, earnings momentum, abnormal returns around earnings announcements, analyst forecast revisions and asset growth.
- Repeating tests on the shorter samples of original studies:
- 65% of all anomalies fail standalone replication tests when limiting influence of microcaps by using NYSE size breakpoints and value-weighted returns.
- 43% of all anomalies fail standalone replication tests and limitations on influence of microcaps removed (NYSE-Amex-NASDAQ size breakpoints and equal-weighted returns). Failure rate rises to 56% with the multiple testing threshold.
- Value, momentum, investment and profitability anomalies tend to replicate well.
In summary, evidence indicates that most published U.S. stock market anomalies are not replicable after reasonably demoting microcaps to a very minor role, and especially after raising the threshold for significance to account for data snooping.
The appendix offers an extensive catalog of published stock return anomalies.
Cautions regarding findings include:
- Reported anomaly returns are gross, not net. Accounting for portfolio reformation and shorting costs, and shorting constraints, would reduce these returns.
- The study focuses on statistical, not economic, significance.
For other relevant perspectives, see: