Author:

I’ve never seen a bad backtest

Dimitris Melas

Whether you are a seasoned trader or a novice, backtesting should be at the heart of everything you do regarding researching strategies. Having said that, the validity of a backtest depends not only on the assumptions used but also on the practices carried out by the researcher. These might introduce different mistakes, which are sometimes subtle enough to pass unnoticed.

One of these errors is overfitting, and in the following article, I will describe what it is, how to identify it, and what you can do to avoid introducing it in the first place.

Although quoting a dictionary is a highly worn-out resource for writers, I am an algorithmic developer and not a novelist. As such, I can indulge in quoting as many dictionaries as I please, and the following definition is one I liked a lot:

“[Overfitting is] the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict observations reliably”

Oxford English Dictionary

In other words, overfitting consists in tuning a set of parameters to replicate as perfectly as possible a given dataset, with complete disregard for how it will perform on new data. In trading, overfitting consists in identifying a false pattern in a dataset by constantly tweaking the parameters of a model and thus introducing hindsight bias.

An Example of Overfitting

Overfitting applies not only to financial markets but to any type of statistical estimation. Consider the following example of the weights and heights of a small sample of individuals:

As common sense suggests, a tall individual has, on average, a higher weight. Without going into further details, there are other variables that have an impact on weight, and adding more explanatory variables could increase the precision of our model. For simplicity, we are going to exclude said variables.

The green line represents a linear regression that estimates an average increase of kilograms for each additional centimeter of height. The red curve, on the other hand, is a polynomial regression of order 4, whose interpretation is not as straightforward.

As you can see, the red line is closer than the green line to each data point, but that increased precision would not extrapolate correctly to new information. If we’d have to estimate the weight of an individual with a height of 172cms, the green line will most probably outperform the red curve. The polynomial regression would forecast a weight of roughly 115 kilograms, which is greater than the estimation it would return for an even taller person. This is, at the very least, counterintuitive.

If I had to guess, I would bet that you’ve encountered a few humans in your life, and your pattern-generating brain correctly knows that the weight of humans does not decrease (on average) between 173cms and 183cms. You might probably also know that it does not increase in a linear way, but it definitely increases.

Example of an Overfitted Trading Strategy

Identifying whether a strategy has been overfitted is oftentimes very straightforward. Consider the following results for different parameters of a moving average crossover strategy:

As you can see, the performance of this strategy is highly dependent on the parameters chosen. This is oftentimes not only acceptable but also impossible to avoid. It would be unreasonable to expect the strategy to remain profitable regardless of the weights assigned to each variable or the values of each parameter.

Having said that, there is an apparent problem with this strategy, and that is its sensitivity to small changes in the parameters. Changing N1 from 50 to 49 or 51 (and N2 from 75 to 74 or 76)  renders a seemingly outstanding strategy into a strikingly bad one. It is highly undesirable for a strategy to feature significant changes in performance based on very small changes in parameters. This is often a symptom of overfitting.

It should be stated that these results are hypothetical and not based on actual calculations. Regardless, the intuition behind the example still transfers to a real-life situation.

As you can see in this scenario, it is desirable to have a strategy with stable parameters. In other words, a strategy whose performance is relatively static on marginal changes in the parameters can be seen as more robust and reliable.

Now that we’ve gone through the essential characteristics of overfitting, we can also relate it to the statistical concept of p-hacking.

P-hacking is the incorrect practice of performing lots of statistical tests, keeping the statistically significant results, and discarding the rest as if you’ve never tested them. The problem with doing so is that most statistical significance metrics, including the p-value, have the underlying assumption that only one test was performed.

You might have heard about the variable alpha when talking about the significance of a statistical result, and it is a measure that tells us the probability of arriving at the result by sheer randomness or luck. A value of 0.05 is frequently chosen, which means that we estimate the probability of the results being random to be 5%. This is fine if you do the test only once, but if you perform 100 tests, you will yield, on average, 5 results that pass the significance test (by definition).

P-hacking is not only relevant for scientists in a lab coat but also for retail traders. By constantly trying out new strategies, it is only a matter of time to come up with a set of technical indicators that will have an outstanding performance in backtests. Most traders forget (or ignore) that they discarded a few dozen strategies before coming up with the seemingly perfect one and do not account for the high probability of this strategy only performing well on the backtesting period due to luck.

Overfitting is also a problem in academia because the assumption that researchers follow the rules of classical statistics (randomization and unbiased reporting) is at odds with the individual incentives caused by publication bias.

Publication bias is the tendency to accept and publish only results that are significant. In economics and finance, this is especially ironic since both fields study incentives in-depth but fail to account for the impact they have on their own academic literature. For example, Hou et al. (2018) tested 452 reported anomalies in the financial markets and were only able to replica 15% of them.

If this is a problem in academia, imagine how rampant it is on Youtube, TikTok, Instagram, or Reddit. In social media, there is a clear monetary incentive to exaggerate trading results. Additionally, the reach of a trading influencer is based on the number of followers and not on their track record. Consequently, the content featured on each platform is not the most accurate but the most entertaining and engaging.

How to avoid overfitting

If we want to avoid overfitting, and we really should want to avoid it, we need to find a set of parameters that are stable, and the strategy should be tested with out-of-sample data.

Out-of-sample testing:

Out-of-sample testing consists of testing the strategy with a different dataset than the one used for “training” (tuning) its parameters. If we were to set the parameters of a moving average in such a way that maximizes the Sharpe Ratio during 2022, we should use these parameters and calculate its performance during 2023. If the performance metrics are similar in both years, we can increase our confidence in the strategy. Conversely, if the performance does not replicate, it might be an indicator of overfitting.

While doing this, you will sooner or later be tempted to go back and forth between the training and test sets until you arrive at a set of parameters that performs well in the test set. If you carefully consider this approach, you’ll realize that it is similar to overfitting but with extra steps.

If you try to “cheat” these safety measures, you’ll eventually face reality during live trading. Needless to say, discarding bad strategies during the backtesting phase is, to put it mildly, more cost-efficient.

Another problem with out-of-sample testing is that we only have one test, which is not enough data to conclusively discard or ship to production a trading strategy. This is why practitioners use rolling windows, which can be represented graphically as follows:

In this hypothetical strategy, we used the 6 previous years to train the model and the following year to test it. Furthermore, we iterate over the entire period by shifting the dataset forward. This exercise can be performed 7 times if we choose a yearly shift between each backtest.

This method is commonly known as walk-forward optimization and is an extremely useful extension of out-of-sample testing. Whenever possible, you should use this approach.

Test for Parameter Stability

Another symptom of parameter overfitting is when the performance of a strategy is drastically affected by a marginal (slight) change in the parameter values. Continuing with the simple moving average crossover strategy used before, consider the following heatmap:

This chart shows the Sharpe Ratio achieved by the strategy under different lookback periods for the moving averages.

If we were to ask two traders which set of parameters we should choose, they might give the following recommendations:

1. Set the parameters at (60; 80)
2. Set the parameters at (10;100)

The first one seeks to globally maximize the Sharpe Ratio of the backtesting period, expecting it to replicate its performance in the future. They ignore that slight changes in the parameters aggressively reduce the result, and some similar parameters even yield negative results.

The second trader, on the other hand, being skilled in the art of detecting overfitting, chooses a set of parameters that is more stable to marginal changes. Additionally, since the local maximum is achieved at the maximum available value for N1, it might be worth exploring further increases in the parameter.

Why you should always avoid in-sample backtests

In short, in-sample backtesting introduces look-ahead bias. This common mistake consists of using data that would not be available in the present in a real-life scenario.

We should always try to do backtests in a manner that is as close as possible to a real live environment. In-sample backtests do not follow this principle since we simulate an optimally chosen strategy based on future data. The strategy’s performance over the entire period is only known at the end of the period, but we are using it for tuning the parameters at the very start. In live trading, this information is, by definition, not available.

To drive this point further home, imagine a finite number of alternative realities in which everything is the same except for the parameters we chose for our strategy. If we know all possible outcomes, we will choose the optimal parameters with complete certainty, whereas, in reality, we will only have a 1/N probability of having chosen that specific set of values.

I once read an opinionated post that stated that a faulty backtest is better than no backtest at all. Although there is some merit to that statement, and I understand what the author meant, it is dangerous if we take it at face value without considering its consequences.

Novice traders most probably cannot differentiate a good backtest from a bad one and will probably do the former one. This test does not add any statistically valid information about the profitability of a strategy but can lead to overconfidence in the strategy.

Following this argument, backtesting should always be done, but only after understanding the underlying assumptions made. As you can see in this other article I wrote on the topic, the considerations are not complex in nature but extremely important.

Categories:

Tags:

[convertkit form=4793161]