A key part of deploying AI is continuously monitoring its performance in the real world. In the previous post we discussed the problem of data drift, which occurs when the input that the model sees moves away from the input that it was trained on, degrading its performance.
This post explores a related risk, that of overtraining on the validation set. This is a risk when data is scarce and expensive but training is comparatively cheap, such as in the medical context, where the temptation is to repeatedly make training decisions using the same validation set. In some sense, this happens because the training consumes more novelty than the dataset brings in.
This post aims to give you some intuition on the cause of this problem, how to detect it, and how to mitigate it. The next post will contain a proposal to detecting this problem with limited visibility (as in the medical device context.)
The Generalization Gap
Any introductory ML course will present chart illustrating the performance of a model on a training set and a test set as training progresses. Performance on the training set is proxy for how much further training is possible. Performance on the test set is a proxy for expected performance on novel data during deployment, which is the most important metric for any ML product. This graph has two key takeaways: first, that model performance on novel data improves only to a point, after which it gets worse (through overfitting). Second, there is a gap between the performance on the training and test data, referred to as the generalization gap.
Basic training performance graph for an ML model.
Good ML courses will point out that you always train a model multiple times (for hyperparameters, ablation studies, etc.) and using the test set to make decisions will cause you to overfit to the test set as well. This makes it a poor proxy for performance on novel data, leaving you with no idea of how the model will perform on novel data. To mitigate this, they introduce a third set of data called a validation set which is used to make decisions on model structure and hyperparameters.
Great ML courses will ask you to think about why the error on the validation set is expected to be lower than the error on the test set, as shown above. This happens because you use the validation set error to make decisions about hyperparameters and model structure. Since you make choices based on what performs best on the validation set, some choices will improve performance on the validation set but not the test set; creating this gap.
This prompts the obvious follow-up question: what factors control the size of this gap?
Novelty Budget
The key factor that narrows this gap is the size of the validation and test sets. The simplest possible solution is, therefore, to increase the size of the validation and test sets. The larger the data sets, the better these two both approximate the underlying distribution, and the fewer decisions you could make that would improve performance on the validation set but not the test set.
However, that’s not always possible. In practice, data (especially medical data) is scarce and expensive to acquire. It must be used judiciously to maximize your return on investment. Thus, a novelty budget.
The concept behind this is simple: treat every new data point as a deposit, and every decision you make about the model as a withdrawal. The more decisions you make, the more data you need to acquire to stay in the black. Thinking this way reframes hyperparameter optimization from a purely computational problem to a business one. When planning the cost of new features or retraining, you should also include the cost of accumulating enough new data to account for the training decisions you make.
The exact amount of data needed is in relation to the complexity of the data generating process being modeled, the complexity of the kinds of models you are considering, and the rate at which the process changes over time. Needless to say, all these are impossible to estimate for any practical problem. Instead of a hard-and-fast rule about how much each data point or decision is worth, we can turn our initial problem into a solution and use the gap between the validation and test set performance as a proxy for the novelty budget. The larger the gap, the more you have overfit the data, and the more data you need to acquire before making any decisions.
The policy recommendation is simple: track the gap between the test and validation set. Don’t use that gap to make decisions about training, but use it to make decisions about acquiring more data. And when you bring in new data, be sure to practice good data hygine by distributing it randomly into the training, validation, and test sets. This ensures that the training, validation, and test data sets all remain representative of the underlying distribution and there is no bias induced by the data acquisition process.
In the final post in this series, I will propose a method to check for both data drift and overtraining on the validation set in the medical context, where you don’t have access to data that your model sees.