Quantcast
Viewing all articles
Browse latest Browse all 12

Answer by Stephan Kolassa for Why is accuracy not the best measure for assessing classification models?

Most of the other answers focus on the example of unbalanced classes. Yes, this is important. However, I argue that accuracy is problematic even with balanced classes.

Frank Harrell has written about this on his blog: Classification vs. Prediction and Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules.

Essentially, his argument is that the statistical component of your exercise ends when you output a probability for each class of your new sample. Mapping these predicted probabilities $(\hat{p}, 1-\hat{p})$ to a 0-1 classification, by choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component. And here, you need the probabilistic output of your model - but also considerations like:

  • What are the consequences of deciding to treat a new observation as class 1 vs. 0? Do I then send out a cheap marketing mail to all 1s? Or do I apply an invasive cancer treatment with big side effects?
  • What are the consequences of treating a "true" 0 as 1, and vice versa? Will I tick off a customer? Subject someone to unnecessary medical treatment?
  • Are my "classes" truly discrete? Or is there actually a continuum (e.g., blood pressure), where clinical thresholds are in reality just cognitive shortcuts? If so, how far beyond a threshold is the case I'm "classifying" right now?
  • Or does a low-but-positive probability to be class 1 actually mean "get more data", "run another test"?

Depending on the consequences of your decision, you will use a different threshold to make the decision. If the action is invasive surgery, you will require a much higher probability for your classification of the patient as suffering from something than if the action is to recommend two aspirin. Or you might even have three different decisions although there are only two classes (sick vs. healthy): "go home and don't worry" vs. "run another test because the one we have is inconclusive" vs. "operate immediately".

The correct way of assessing predicted probabilities $(\hat{p}, 1-\hat{p})$ is not to compare them to a threshold, map them to $(0,1)$ based on the threshold and then assess the transformed $(0,1)$ classification. Instead, one should use proper . These are loss functions that map predicted probabilities and corresponding observed outcomes to loss values, which are minimized in expectation by the true probabilities $(p,1-p)$. The idea is that we take the average over the scoring rule evaluated on multiple (best: many) observed outcomes and the corresponding predicted class membership probabilities, as an estimate of the expectation of the scoring rule.

Note that "proper" here has a precisely defined meaning - there are improper scoring rules as well as proper scoring rules and finally strictly proper scoring rules. Scoring rules as such are loss functions of predictive densities and outcomes. Proper scoring rules are scoring rules that are minimized in expectation if the predictive density is the true density. Strictly proper scoring rules are scoring rules that are only minimized in expectation if the predictive density is the true density.

As Frank Harrell notes, accuracy is an improper scoring rule. (More precisely, accuracy is not even a scoring rule at all: see my answer to Is accuracy an improper scoring rule in a binary classification setting?) This can be seen, e.g., if we have no predictors at all and just a flip of an unfair coin with probabilities $(0.6,0.4)$. Accuracy is maximized if we classify everything as the first class and completely ignore the 40% probability that any outcome might be in the second class. (Here we see that accuracy is problematic even for balanced classes.) Proper will prefer a $(0.6,0.4)$ prediction to the $(1,0)$ one in expectation. In particular, accuracy is discontinuous in the threshold: moving the threshold a tiny little bit may make one (or multiple) predictions change classes and change the entire accuracy by a discrete amount. This makes little sense.

More information can be found at Frank's two blog posts linked to above, as well as in Chapter 10 of Frank Harrell's Regression Modeling Strategies.

(This is shamelessly cribbed from an earlier answer of mine.)


EDIT. My answer to Example when using accuracy as an outcome measure will lead to a wrong conclusion gives a hopefully illustrative example where maximizing accuracy can lead to wrong decisions even for balanced classes.


Viewing all articles
Browse latest Browse all 12

Trending Articles