Addressing data mismatch

If your training set comes from a different distribution, than your dev and test set, and if error analysis shows you that you have a data mismatch problem, what can you do?

Carry out manual error analysis to try to understand difference between training and dev/test sets
- example: dev set may be more noisy
- Lookinto how the dev set differes from train-dev set.
Make training data more similar; or collect more data similar to dev/test representations
- Example: simulate noisy data for dev.
- Example: artificial data synthestis

Artificial data synthestis

Example 1

Let’s say you have 10,000 hours of audio data and 1 hour of car noise

you may overfit the model to 1 hour of car noise if you multiply 1 hr of car noise to 10,000 times.
getting unique 10,000 car noise probably make the model more robust.

Example 2

You could use computer graphics to synthesize car pictures, but again this results in overfitting the model to the set of computer graphics.