How should investors interpret findings of statistical significance in academic studies of financial markets? In the March 2014 draft of their paper entitled “Significance Testing in Empirical Finance: A Critical Review and Assessment”, Jae Kim and Philip Ji review significance testing in recent research on financial markets. They focus on interplay of two types of significance: (1) the probability of a Type I error (the probability of rejecting a true null hypothesis), with significance threshold usually set at 1%, 5% (most used) or 10%; and, (2) the probability of a Type II error (the probability of accepting a false null hypothesis). They consider the losses associated with the significance threshold, and they assess the Bayesian method of inference as an alternative to the more widely used frequentist method associated with conventional significance testing. Based on review of past criticisms of conventional significance testing and 161 studies applying linear regression recently published in four top-tier finance journals, they conclude that:
- Use of massive samples, insensitively to a balance between Type I and Type II errors, tends to overstate statistical significance.
- When sample size is massive (small), researchers should tighten (relax) the threshold for statistical significance. For example, researchers should use 0.5% or 0.1% thresholds for massive samples rather than the exclusively used 1%, 5% and 10% thresholds. One way to optimize the significance threshold is through explicit balancing of the probabilities of Type I and Type II errors. Another way is to minimize the potential economic loss driven by the significance threshold.
- Alternatively, researchers could use the Bayesian method of inference, effectively tightening the significance threshold as sample size grows. Only 32% of published studies (all frequentist) offer strong evidence of an effect/anomaly under Bayesian analysis. For the subset of these studies with sample size under 1,000, 61% survive Bayesian analysis.
- Leading financial journals appear to be biased in favor of studies that demonstrate conventional levels of statistical significance. Such biased feedback may drive researchers toward massive samples that make finding conventional significance easier.
- It appears that replication of published studies is problematic. Out of email requests to 50 authors: four responded with data and elaboration to facilitate replication; 12 declined to share their data (for reasons such as copyright and confidentiality); and 34 have not responded.
- Simple visual representations of the data sometimes suggest predictions that outperform those based on the R-squared statistic or the t-statistic.
In summary, significance testing methods exhibited in formal (published) finance research are conducive to spurious statistical significance and economically incorrect decisions.
Cautions regarding conclusions include:
- The paper does not address data snooping bias and its implication for significance testing. See, for example:
- The paper does not address failure of many formal studies to account for implementation frictions, essential for assessing economic significance.