When starting out on a new project, I try to quickly choose dev/test sets, since this gives the team a well-defined target to aim for.
Cat dataset examples: Suppose that for your cat application, your metric is classification accuracy. This metric currently ranks classifier A as superior to classifier B. But suppose you try out both algorithms, and find classifier A is allowing occasional pornographic images to slip through. Even though classifier A is more accurate, the bad impression left by the occasional pornographic image means its performance is unacceptable. What do you do? Here, the metric is failing to identify the fact that Algorithm B is in fact better than Algorithm A for your product. So, you can no longer trust the metric to pick the best algorithm. It is time to change evaluation metrics
Metric: classification error
Classifier A: 3% error but also gives some pornographic images
Classifier B: 5% error
Metric and dev prefers A, but users prefer B. In this case, you want to add a weight.
Error:
where is a predicted value (0 or 1) and
so this gives larger cost for pornographic images.
and should be changed to so we get the cost between 0 and 1.
So,
Suppose your initial dev/test set had mainly pictures of adult cats. You ship your cat app, and find that users are uploading low resolution images than expected.
Users care about the images they upload.
So, the dev/test set distribution is not representative of the actual distribution you need to do well on. In this case, update your dev/test sets to be more representative.
So, if doing well on your metric + dev/test set doest not correspond to doing well on your application. Need to change your metric and/or dev/test set.