One of the problems of accuracy is that it ignores the intrinsic difficulty in the data-generating mechanism. Here, the difficulty refers to the uncertainty of the label, which can be measured by its variance. The point is that when assessing a classifier, a misclassification can be due to high uncertainty of the data-generating mechanism instead of the flaw of the model.
For instance, we predict $Y\in\{0,1\}$ from $X\in \mathcal X$. Assume that there is an unknown data-generating probability function $P(Y=1\mid X=x)$, $x\in\mathcal X$. The variance $$Var(Y=1\mid X=x)=P(Y=1\mid X=x)(1-P(Y=1\mid X=x)),$$which is maximized when $P(Y=1\mid X=x)=0.5$ and minimized when $P(Y=1\mid X=x)=0$ or $1$. Suppose $\exists x_1,x_2\in\mathcal X$ such that$$P(Y=1\mid X=x)=\begin{cases}0.5&\ \text{when } x=x_1,\\0.99&\ \text{when } x=x_2.\end{cases}$$When evaluating the performance of a classifier, a misclassification at $x_1$ should be considered less severe than a misclassification at $x_2$. However, this problem of heteroscedasticity in variance is not reflected in accuracy. Also, the best possible accuracy depends on the $P(Y=1\mid X=x)$ and can be very different on different data sets.
To take the unknown $P(Y=1\mid X=x)$ into account and assess how well the classifier performance compared with the best possible performance, we may apply goodness-of-fit tests, e.g., Pearson's chi-squared test, Residual deviance test, and Hosmer–Lemeshow test. Although the majority of the goodness-of-fit tests only apply to parametric models, there is a recent work Is a Classification Procedure Good Enough?—A Goodness-of-Fit Assessment Tool for Classification Learning that addresses the evaluation problem of general classifiers.