When it comes to machine learning and statistics in general or trading in particular, accuracy serves as one of the most widely used and valuable metrics that practitioners use. Having said that, using this metric in isolation is not always enough since it can be misleading depending on the phenomena being measured.
In this article, I will detail cases in which using accuracy as the only metric should be discouraged.
Although the definition of accuracy used in statistics in general and in trading n particular is the same, I will make a distinction between both in order to make this resource useful for both people interested in data science and folks getting started in the financial markets.
What is accuracy?
The standard definition of accuracy can be calculated as the ratio of correct classifications performed by any given classifier divided by the number of classifications done. The formula looks as follows:
Accuracy = #Correct_Classifications / #Total_Classifications
This metric is very valuable and useful when dealing with balanced datasets. In other words, accuracy serves as a good metric when we have the same amount of samples for each class.
Why is accuracy not a good measure in statistics and machine learning?
Before moving forward, I have to state that accuracy is, in fact, an excellent measure. Having said that, like any other measure, it condenses information in order to analyze some specific aspect or dimension at the expense of losing potentially valuable information.
Having said that, consider a classification problem where we want to guess the IQ of individuals based on a set of answers or attributes. As you may or may not know, IQ is measured and calibrated in such a way that the median (50th percentile) people score exactly 100 on the test. Additionally, scores fit a normal distribution with a standard deviation of 15 points. This means that, for example, 97.7% of the population has an IQ below 130.
Due to some random need, we might be interested in finding people with a high IQ score, which we will define as greater than 130. In order to do so, we are given a dataset of people, some attributes, and their IQ scores with the objective of training a simple classifier.
You might notice that we are dealing with a highly imbalanced dataset:
- 97.7% of our dataset has an IQ lower than 130
- 2.3% of our dataset has an IQ equal to or greater than 130
Given this scenario, a very useless but highly accurate model would classify every single person as not having a high IQ and thus would achieve the incredible accuracy of 97.7%. Although accurate, the model would not be useful in detecting high IQ individuals since none of them would be classified in that way.
The previous example outlined how a deterministic (always “False”) model can achieve high accuracy values, but the same can be achieved by creating a random classifier. If we want to classify people into the same two categories as previously, we can create a classifier that randomly guesses as follows:
Classification[Sample] = Guess[(IQ>130 with p = 0.05; IQ<130 with p = 0.95)]
Although it randomly guesses the IQ of a given person, the classifier takes into account the population probabilities and skews the guesses accordingly (although not exactly). The accuracy of this model would be the following:
Accuracy = 0.95 * 0.977 + 0.05 * 0.023 = 92.93%
This second classifier has the apparent advantage of sometimes spotting a high IQ individual, whereas the first model, by definition, is not able to do so. Having said that, the second model has no merit in sometimes correctly identifying such individuals and only does so by pure luck.
Why is accuracy not a good measure in trading?
It is common to read or hear about trading strategies and their respective accuracies, and although it is a relevant metric for analyzing backtests and trading logs, it is often a very misleading piece of information.
As I’ll show in the following paragraphs, it is extremely simple to have a highly accurate trading strategy, although probably not a very profitable one. As a rule of thumb, you should always be wary about people talking only about the accuracy of their trading signals.
Say that we buy a given asset for $100 with a take profit of $101 and a stop loss of $91. You might notice that the distance between the current price and the triggers is not equal, which makes it much extremely more probable to exit the position via the take profit. Without being very precise, lets assume that the probability of the price reaching the TP before the SL is above 90%.
If we were to classify the accuracy of a strategy by whether its PnL is positive or negative, which is the usual convention, this would also translate to an accuracy of 90%.
Of course, traders should care more about profitability and less about Accuracy, so let’s go ahead and calculate the average return of each trade:
E[Return] = 10% * ($91 - $100) + 90% * ($101- $100) E[Return] = $-0.9 + $0.9 = $0
As you can see, despite the outstanding accuracy of this strategy, it does not provide, on average, positive returns. The previous exercise assumes that there is an equal probability of the stock going either up or down by some amount.
Not only does the highly accurate strategy have zero expected return, but if we incorporate trading fees and slippage into our calculations, which we should always do, the expected return would be rendered negative.
These types of strategies are very appealing to newcomers due to their skewed return profile. Since most trades do close positive, albeit in small amounts, they give the misleading idea of being profitable in general. Of course, these small profits are reduced to zero by the less probable but higher losses incurred when the price reaches the stop loss.
Beginner traders have the erroneous belief that these unfavorable outcomes are due to bad luck and that the strategy overall is profitable.
This example is quite similar to the one given in the previous section regarding people and their IQ points. They are analogous in that it is both very likely that a random person has an IQ below 130 points and that the trade closes due to the take profit instead of by the very distant stop loss.
In both examples, we also assumed that the probability distribution function is symmetrical with respect to the mean. In the first case, we assumed normal distribution. Although this is also a very popular assumption for asset returns, we just assumed it was symmetric in order to make the calculation as simple as possible.
Other metrics used along Accuracy
As can be seen in the previous sections, the definition of accuracy is both the same in the statistics and machine learning world as in the trading world. Separating the explanation into two is an artificial distinction I’ve made to make the article easier to browse for both types of professionals.
In the following sections, I’ll go through the common metrics that should be used along Accuracy in order to have a better understanding of a model or trading strategy.
Other Trading Metrics
Rather than just using accuracy, there is a plethora of metrics that should be used in addition to accuracy. I’ll only provide a few of these metrics and explain how analyzing them adds additional information.
Average return per trade
Useful for measuring how robust or sensitive the strategy is too small changes in trading costs and slippage assumptions. A strategy that places few but highly profitable trades will be able to withstand trading fees better than a strategy that places lots of trades with a lower average return per trade. You can also divide this metric into Average Return of Winners and Average Loss of Losers.
The Sharpe Ratio is a measure of performance after adjusting for risk. Two strategies can have the same return over a given period, but the one that achieves it with less exposure to risk is going to be superior to the other one.
The Maximum Drawdown consists of the maximum consecutive loss incurred during a period. The results of a trading strategy might be promising in a backtesting scenario, but it is important to analyze if the max. Drawdown would have resulted in us stopping the strategy from trading if it were a live trading scenario.
Other Statistics and Machine Learning Metrics
Accuracy per class
Instead of calculating Accuracy for the entire dataset, we can do it for every individual class. This can be done as follows:
Accuracy_class = Correct_class / (Correct_class + Incorrect_class)
Although simple in nature, calculating Accuracy per Class would have identified the issue of the IQ classifier from the previous section. It would have resulted in a 100% accuracy for spotting individuals with an IQ<130 and a 0% accuracy for spotting the ones with an IQ>130.
Precision and Recall
Precision: If we consider the “True” instances of our classification as the relevant elements, precision is defined as the ratio of true positives over the total elements classified as relevant (both true positives and false positives). In the context of the IQ example, recall serves as a measure of how many of the individuals that we identified as having an IQ really have a high IQ.
The formula for Precision can be visually interpreted as follows:
Recall: recall tells us the percentage of “True” instances that we were able to identify. In the context of the IQ example, it tells us how many of the total high IQ individuals we were able to identify.
The formula for Recall can be visually interpreted as follows:
Taking both metrics into account, we can clearly conclude that Accuracy in isolation is by no means a sufficient indicator.
Additionally, it can also be seen that the importance of each metric depends on the specifics of each problem. In the context of the IQ example, should the classifier minimize the probability of not spotting a high IQ individual (maximize Recall), or should it minimize the number of individuals erroneously classified as having high IQs (maximize Precision)? It can be seen that this problem has no clear answer and that minimizing one error oftentimes comes at the expense of increasing the other one.
In the context of trading, we might be interested only in forecasting strong positive moments. This could be in order to only forecast trades that would, at least in principle, cover trading fees and slippage.
In such a scenario, we might be interested in maximizing the classifier’s precision. This would in theory, improve our average return per trade but at the expense of missing lots of potentially profitable trades. This is common practice in proprietary firms, and the common way of dealing with this is to scale the algorithm to hundreds of assets to generate enough signals still.