Lookahead Bias in Large Language Model Training Data
April 26, 2024 - Investing Expertise
Can Large Language Models (LLM) inject lookahead bias into backtests when rigor is lacking in generation of LLM training samples? In their preliminary and incomplete March 2024 paper entitled “Lookahead Bias in Pretrained Language Models”, Suproteem Sarkar and Keyon Vafa examine the potential for lookahead bias in backtests using the Llama-2 LLM to identify future firm risks based on content of earnings calls. They consider cases for which: (1) the backtest falls within the LLM training sample, but the researcher tells the LLM to consider only information before the test period; and, (2) the researcher specifies a training sample that ends before the backtest but generates it long after the end of the training sample. Using Llama-2 to interpret transcripts of selected firm earnings calls from 2018, they find that: