Quantcast
Channel: Why is accuracy not the best measure for assessing classification models? - Cross Validated
Viewing all articles
Browse latest Browse all 13

Answer by Dikran Marsupial for Why is accuracy not the best measure for assessing classification models?

$
0
0

To address the original question more directly,

'Accuracy, the proportion of correct classifications among allclassifications, is very simple and very "intuitive" measure'

This is indeed true, and sometimes it is better to have a metric that the "client" (if you have one) understands than a better metric that they don't. It also has the advantage that sometimes it is the metric of primary interest for the practical application (as mentioned in my previous answer).

"yet it may be a poor measure for imbalanced data. Why does ourintuition misguide us here and are there any other problems withthis measure?"

This is also true, but there is an easy fix for our intuition failure. Rather than look at accuracy, consider the gain in accuracy that we can achieve by using the input data. For a binary classification problem, we could use something like:

$$score = \frac{Accuracy - \pi}{1 - \pi}$$

Where $\pi$ is the prior probability (class frequency) of the majority class. Our intuition works much better with this score: A perfect classifier will give a score of 1; a classifier that assigns all patterns to the majority class ("just guessing") gets a score of 0. Any classifier that fails to be as good as guessing the majority class gets a negative score, which is nice and easy to understand.

This has been in use since at least the early 1990s (I first saw it on a training course I did as a student), and it is a special case of Cohen's kappa statistic.

The important thing here, is that this is just an affine transformation of accuracy, so it tells us exactly the same thing - it just does it on a scale that is less prone to misinterpretation by our intuitive biases. It is a bit like the difference between Centigrade, Farenheit and Kelvin - they are all temperature scales that tell us the same thing.

I think this completely solves the issue relating to imbalanced learning tasks.

Accuracy still has problems as a model selection criterion that are not solved by this rescaling (discontinuous, can be brittle), but we should make a distinction between model selection and model evaluation, and we don't need to use the same criterion for both (I often use Brier score for model selection even where accuracy is a key performance metric).


Viewing all articles
Browse latest Browse all 13

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>