Training and testing on different distributions

In a deep learning era, more and more teams are now training on data that comes from a different distribution than your dev and test sets. There are some best practices for dealing with when you’re training and test distributions differ from each other.

Cat mobile app example

Users will upload pictures taken from their cell phones, and you want to recognize whether the pictures that your users upload from the mobile app is a cat or not.

You care about data from mobile app but the number of samples are not large. What can we do?

Not Good Option

Put everythiong together, randomly shuffle to train, dev, and test.

So this option is not good.

Good Option

Disadvantage is that the distribution of training set is different from dev and test set.

speech activated rearview mirror for a car

x=audio clip y=transcript

Training Dev/test
Purchased data Speech activated rearview mirror
Smart speaker control
voice keyboard
500,000 samples 20,000 samples

Options