Quantcast
Viewing all articles
Browse latest Browse all 12

Answer by mroman for Why is accuracy not the best measure for assessing classification models?

The problem with accuracy

Standard accuracy is defined as the ratio of correct classifications to the number of classifications done.

\begin{align*}accuracy := \frac{\text{correct classifications}}{\text{number of classifications}}\end{align*}

It is thus an overall measure over all classes and as we'll shortly see it's not a good measure to tell anoracle apart from an actual useful test. An oracle is a classification function that returns a random guessfor each sample. Likewise, we want to be able to rate the classification performance of our classification function. Accuracy can be a useful measure if we have the same amount of samples per class but if wehave an imbalanced set of samples accuracy isn't useful at all. Even more so, a test can have a high accuracybut actually perform worse than a test with a lower accuracy.

If we have a distribution of samples such that $90\%$ of samples belong to class $\mathcal{A}$, $5\%$ belonging to $\mathcal{B}$ and another $5\%$ belonging to $\mathcal{C}$ then the following classification function will have an accuracy of $0.9$:

\begin{align*}classify(sample) := \begin{cases}\mathcal{A} & \text{if }\top \\\end{cases}\end{align*}

Yet, it is obvious given that we know how $classify$ works that this it can not tell the classesapart at all. Likewise, we can construct a classification function

\begin{align*}classify(sample) := \text{guess} \begin{cases}\mathcal{A} & \text{with p } = 0.96 \\\mathcal{B} & \text{with p } = 0.02 \\\mathcal{C} & \text{with p } = 0.02 \\\end{cases}\end{align*}

which has an accuracy of $0.96 \cdot 0.9 + 0.02 \cdot 0.05 \cdot 2 = 0.866$ and will not always predict$\mathcal{A}$ but still given that we know how $classify$ works it is obvious that it can not tell classes apart.Accuracy in this case only tells us how good our classification function is at guessing. This means thataccuracy is not a good measure to tell an oracle apart from a useful test.

Accuracy per Class

We can compute the accuracy individually per class by giving our classification function onlysamples from the same class and remember and count the number of correct classifications and incorrectclassifications then compute $accuracy := \text{correct}/(\text{correct} + \text{incorrect})$. We repeat thisfor every class. If we have a classification function that can accurately recognize class$\mathcal{A}$ but will output a random guess for the other classes then this results in an accuracy of $1.00$ for$\mathcal{A}$ and an accuracy of $0.33$ for the other classes. This already provides us a much better way tojudge the performance of our classification function. An oracle always guessing the same class will producea per class accuracy of $1.00$ for that class, but $0.00$ for the other class. If our test is usefulall the accuracies per class should be $>0.5$. Otherwise, our test isn't better than chance. However, accuracyper class does not take into account false positives. Even though our classification function has a $100\%$ accuracyfor class $\mathcal{A}$ there will also be false positives for $\mathcal{A}$ (such as a $\mathcal{B}$ wronglyclassified as a $\mathcal{A}$).

Sensitivity and Specificity

In medical tests sensitivity is defined as the ratio between people correctly identified as having the diseaseand the amount of people actually having the disease. Specificity is defined as the ratio between people correctlyidentified as healthy and the amount of people that are actually healthy. The amount of people actually havingthe disease is the amount of true positive test results plus the amount of false negative test results. Theamount of actually healthy people is the amount of true negative test results plus the amount of false positivetest results.

Binary Classification

In binary classification problems there are two classes $\mathcal{P}$ and $\mathcal{N}$. $T_{n}$ refers to the numberof samples that were correctly identified as belonging to class $n$ and $F_{n}$ refers to the number of samplesthat werey falsely identified as belonging to class $n$. In this case sensitivity and specificity aredefined as following:

\begin{align*}sensitivity := \frac{T_{\mathcal{P}}}{T_{\mathcal{P}}+F_{\mathcal{N}}} \\specificity := \frac{T_{\mathcal{N}}}{T_{\mathcal{N}}+F_{\mathcal{P}}}\end{align*}

$T_{\mathcal{P}}$ being the true positives $F_{\mathcal{N}}$ being the false negatives, $T_{\mathcal{N}}$being the true negatives and $F_{\mathcal{P}}$ being the false positives. However, thinking in termsof negatives and positives is fine for medical tests but in order to get a better intuition we should notthink in terms of negatives and positives but in generic classes $\alpha$ and $\beta$. Then, we can say thatthe amount of samples correctly identified as belonging to $\alpha$ is $T_{\alpha}$ and the amount of samplesthat actually belong to $\alpha$ is $T_{\alpha} + F_{\beta}$. The amount of samples correctly identified as notbelonging to $\alpha$ is $T_{\beta}$ and the amount of samples actually not belonging to $\alpha$ is $T_{\beta} + F_{\alpha}$. This gives us the sensitivity and specificity for $\alpha$ but we can also apply thesame thing to the class $\beta$. The amount of samples correctly identified as belonging to $\beta$ is$T_{\beta}$ and the amount of samples actually belonging to $\beta$ is $T_{\beta} + F_{\alpha}$. The amount ofsamples correctly identified as not belonging to $\beta$ is $T_{\alpha}$ and the amount of samples actuallynot belonging to $\beta$ is $T_{\alpha} + F_{\beta}$. We thus get a sensitivity and specificity per class:

\begin{align*}sensitivity_{\alpha} := \frac{T_{\alpha}}{T_{\alpha}+F_{\beta}} \\specificity_{\alpha} := \frac{T_{\beta}}{T_{\beta} + F_{\alpha}} \\sensitivity_{\beta} := \frac{T_{\beta}}{T_{\beta}+F_{\alpha}} \\specificity_{\beta} := \frac{T_{\alpha}}{T_{\alpha} + F_{\beta}} \\\end{align*}

We however observe that $sensitivity_{\alpha} = specificity_{\beta}$ and $specificity_{\alpha} = sensitivity_{\beta}$. This means that if we only have two classes we don't need sensitivity and specificity per class.

N-Ary Classification

Sensitivity and specificity per class isn't useful if we only have two classes, but we can extend it tomultiple classes. Sensitivity and specificity is defined as:

\begin{align*}\text{sensitivity} := \frac{\text{true positives}}{\text{true positives} + \text{false negatives}} \\\text{specificity} := \frac{\text{true negatives}}{\text{true negatives} + \text{false-positives}} \\\end{align*}

The true positives is simply $T_{n}$, the false negatives is simply $\sum_{i}(F_{n,i})$ and thefalse positives is simply $\sum_{i}(F_{i,n})$. Finding the true negatives is much harder but we can say thatif we correctly classify something as belonging to a class different than $n$ it counts as a true negative.This means we have at least $\sum_{i}(T_{i}) - T(n)$ true negatives. However, this aren't all true negatives.All the wrong classifications for a class different than $n$ are also true negatives, because they correctlyweren't identified as belonging to $n$. $\sum_{i}(\sum_{k}(F_{i,k}))$ represents all wrong classifications.From this we have to subtract the cases where the input class was $n$ meaning we have to subtract the false negatives for $n$ which is $\sum_{i}(F_{n,i})$ but we also have to subtract the false positives for $n$ becausethey are false positives and not true negatives so we have to also subtract $\sum_{i}(F_{i,n})$ finally getting$\sum_{i}(T_{i}) - T(n) + \sum_{i}(\sum_{k}(F_{n,i})) - \sum_{i}(F_{n,i}) - \sum_{i}(F_{i,n})$. As a summary we have:

\begin{align*}\text{true positives} := T_{n} \\\text{true negatives} := \sum_{i}(T_{i}) - T(n) + \sum_{i}(\sum_{k}(F_{n,i})) - \sum_{i}(F_{n,i}) - \sum_{i}(F_{i,n}) \\\text{false positives} := \sum_{i}(F_{i,n}) \\\text{false negatives} := \sum_{i}(F_{n,i})\end{align*}

\begin{align*}sensitivity(n) := \frac{T_{n}}{T_{n} + \sum_{i}(F_{n,i})} \\specificity(n) := \frac{\sum_{i}(T_{i}) - T_{n} + \sum_{i}(\sum_{k}(F_{i,k})) - \sum_{i}(F_{n,i}) - \sum_{i}(F_{i,n})}{\sum_{i}(T_{i}) - T_{n} + \sum_{i}(\sum_{k}(F_{i,k})) - \sum_{i}(F_{n,i})}\end{align*}

Introducing Confidence

We define a $confidence^{\top}$ which is a measure of how confident we can bethat the reply of our classification function is actually correct. $T_{n} + \sum_{i}(F_{i,n})$ are all caseswhere the classification function replied with $n$ but only $T_{n}$ of those are correct. We thus define

\begin{align*}confidence^{\top}(n) := \frac{T_{n}}{T_{n}+\sum_{i}(F_{i,n})}\end{align*}

But can we also define a $confidence^{\bot}$ which is a measure of how confident we can be that if ourclassification function responds with a class different than $n$ that it actually wasn't an $n$?

Well, we get $\sum_{i}(\sum_{k}(F_{i,k})) - \sum_{i}(F_{i,n}) + \sum_{i}(T_{i}) - T_{n}$ all of which are correct except $\sum_{i}(F_{n,i})$.Thus, we define

\begin{align*}confidence^{\bot}(n) = \frac{\sum_{i}(\sum_{k}(F_{i,k})) - \sum_{i}(F_{i,n}) + \sum_{i}(T_{i}) - T_{n}-\sum_{i}(F_{n,i})}{\sum_{i}(\sum_{k}(F_{i,k})) - \sum_{i}(F_{i,n}) + \sum_{i}(T_{i}) - T_{n}}\end{align*}


Viewing all articles
Browse latest Browse all 12

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>