Here is a somewhat adversarial counter-example, where accuracy is better than a proper scoring rule, based on @Benoit_Sanchez's neat thought experiment,
You own an egg shop and each egg you sell generates a net revenue of 2dollars. Each customer who enters the shop may either buy an egg orleave without buying any. For some customers you can decide to make adiscount and you will only get 1 dollar revenue but then the customerwill always buy.
You plug a webcam that analyses the customer behaviour with featuressuch as "sniffs the eggs", "holds a book with omelette recipes"... andclassify them into "wants to buy at 2 dollars" (positive) and "wantsto buy only at 1 dollar" (negative) before he leaves.
If your classifier makes no mistake, then you get the maximum revenueyou can expect. If it's not perfect, then:
for every false positive you lose 1 dollar because the customerleaves and you didn't try to make a successful discount
for every false negative you lose 1 dollar because you make a useless discount
Then the accuracy of your classifier is exactly how close you are tothe maximum revenue. It is the perfect measure.
So say we record the amount of time the customer spends "sniffing eggs" and "holding a book with omelette recipes" and make ourselves a classification task:
This is actually my version of Brian Ripley's synthetic benchmark dataset, but lets pretend it is the data for our task. As this is a synthetic task, I can work out the probabilities of class membership according to the true data generating process:
Unfortunately it is upside-down because I couldn't work out how to fix it in MATLAB, but please bear with me. Now in practice, we won't get a perfect model, so here is a model with an error (I have just perturbed the true posterior probabilities with a Gaussian bump).
And here is another one, with a bump in a different place.
Now the Brier score is a proper scoring rule, and it gives a slightly lower (better) score for the second model (because the perturbation is in a region of slightly lower density). However, the perturbation in the first model is well away from the decision boundary, and so that one has a higher accuracy.
Since in this particular application, the accuracy is equal to our financial gain in dollars, the Brier score is selecting the wrong model, and we will lose money.
Vapnik's advice that it is often better to form a purely discriminative classifier directly (rather than estimate a probability and threshold it) is based on this sort of situation. If all we are interested in is making a binary decision, then we don't really care what the classifier does away from the decision boundary, so we shouldn't waste resources modelling features of the data distribution that don't affect the decision.
This is a Laconic "if" though. If it is a classification task with fixed misclassification costs, no covariate shift and known and constant operational class priors, then this approach may indeed be better (and the success of the SVM in many practical applications is some evidence of that). However, many applications are not like that, we may not know ahead of time what the misclassification costs are, or equivalently the operational class frequencies. In those applications we are much better off with a probabilistic classifier, and set the thresholds appropriately according to operational conditions.
Whether accuracy is a good performance metric depends on the needs of the application, there is no "one size fits all" policy. We need to understand the tools we use and be aware of their advantages and pitfalls, and consider the purpose of the exercise in choosing the right tool from the toolbox. In this example, the problem with the Brier score is that it ignores the true needs of the application, and no amount of adjusting the threshold will compensate for its selection of the wrong model.
It is also important to make a distinction between performance evaluation and model selection - they are not the same thing, and sometimes (often?) it is better to have a proper scoring rule for model selection in order to achieve maximum performance according to your metric of real interest (e.g. accuracy).