Are the human choices in studies that apply machine learning models to forecast stock returns critical to findings? In other words, is there a confounding machine learning design choices zoo? In their November 2024 paper entitled “Design Choices, Machine Learning, and the Cross-section of Stock Returns”, Minghui Chen, Matthias Hanauer and Tobias Kalsbach analyze effects of varying seven key machine learning design choices: (1) machine learning model used, (2) target variable/evaluation metric, (3) target variable transformation (continuous or discrete dummy), (4) whether to use anomaly inputs from pre-publication subperiods or not, (5) whether to compress correlated features, (6) whether to sue a rolling or expanding training window and (7) whether to include micro stocks in the training sample. They examine all possible combinations of these choices, resulting in 1,056 machine learning models. For each machine learning model each month, they:
- Rank stocks on each of 207 potential return predictors and map rankings into [-1, 1] intervals. In case of missing inputs, they set the ranking value to 0.
- Apply rankings to predict a next-month target variable (return in excess of the risk-free rate, market-adjusted return or 1-factor model risk-adjusted return) for each stock with market capitalization above a 20% NYSE threshold during January 1987 through December 2021.
- Reform a hedge portfolio that is long (short) the value-weighted tenth, or decile, of stocks with the highest (lowest) predicted target variable and compute next-month portfolio return.
Using monthly data as available for all listed U.S. common stocks during January 1957 through December 2021, they find that: Keep Reading