A key part of deploying AI is continuously monitoring its performance in the real world. This is necessary because the input that the model sees over time tends to move away from the input that it was trained on, degrading its performance. This problem is known as data drift and is a well-understood problem with popular solutions that already exist for the general case. However, these solutions generally rely on being able to observe the data as it comes in.
Recently, I was speaking with some friends at CareCam, who presented me with an interesting challenge: doing this from inside a hospital’s air-gapped secure network. CareCam uses commodity hardware and bespoke AI to perform motion studies on people to diagnose medical problems. Their flagship product is FDA-approved to perform gait analysis with AI, providing results that would normally take an expensive motion studio to generate. As with any AI product, it is important for them to monitor its ongoing performance.
However, there’s a catch: when deploying AI in healthcare settings, all data must remain solely within the hospital network. Technicians are not even permitted to access any raw user data. At most, they may only view anonymized text logs and copy down whatever they need onto a sheet of paper. This restriction on visibility makes it very difficult to monitor the ongoing performance of AI.
This post explains this problem in technical detail. It assumes some familiarity with basic ML concepts: train/test splits, data distributions, and generalization. The next posts in this series will introduce the related problem of data novelty budgets and my proposed solution.
The Proposed Solution: Verifier Models
The solution I proposed to CareCam is to track the solutions generated by the certified “FDA-approved” AI model and compare them to the solutions generated by a set of other non-certified verifier models trained on the same data but with different hyperparameters. During the clinical use, data inputs and outputs of the certified model is temporarily held in memory, and after the session is complete, the certified model’s output is compared to the outputs of the other models. The device only saves statistics about the divergence between the certified model and the others, not the actual data itself, which is in line with the hospital’s data security policies. These statistics can be copied down onto a sheet of paper and tracked by a technician.
Verifier models are trained on the same data as the certified model, but with different hyperparameters.
This is related to ensembling, but without any merging or combining of the models output. This is an important distinction because the certified model must be the only one used for diagnosis to maintain regulatory compliance.
Why Does This Work?
This uses a fascinating insight from RL theory: the distinction between aleatoric and epistemic uncertainty. Aleatoric uncertainty is the inherent uncertainty in the generating process (such as camera noise, the results of a coin flip, etc.) Epistemic uncertainty is our uncertainty in modeling the generating process (such as knowing the distribution of the coin.) Fundamentally, epistemic uncertainty is a measure of how much better a model could do with potentially infinite data and compute, while aleatoric uncertainty is the limit of what a “perfect” model could do.
In RL, it is common to use ensembles of models to measure the epistemic uncertainty of the value function (among other uses.) The intuition is that “completely” trained models in the neighbourhood of some evaluation datapoint will make similar predictions on that point, so the agreement between different models is a measure of
- how well the training has converged, and
- how close the evaluation datapoint is to the support of the training dataset.
In this case, the former is a proxy for overfitting, and the latter measures data drift. We assume that training is “complete” (that is, further training won’t change predictions much) and compare our predictions: if the certified model’s output diverges from the other models, then we can blame it on either overfitting or data drift. And that’s it!
The last piece of the puzzle is to decide how much divergence is enough to trigger an alert. There’s no theoretical limit on how much divergence is tolerable; this depends on the risk assesment submitted during the medical device certification. Estimating this systematically would require long-term tracking of these divergences and their correlation with changes in diagnosis, balanced with business considerations such as the cost of collecting new data and recertifying the model.
The Catch
This is a simple solution, but it has a few caveats:
- It can’t reliably differentiate between overfitting and data drift, because both will cause the models to diverge.
- The remedy for each is different: overfitting requires regularizing the model or rolling back to an older state, while data drift requires gathering new data and retraining the model.
- This is not a problem in practice, as long as the divergence is only used to trigger an alert for further investigation.
- It assumes that the underlying process we are fitting to is deterministic, which is reasonable.
- It assumes that the models are trained to convergence, but this is occasionally not the case.
- It assumes that the verifier models are trained on similar data as the certified model, which may be expensive.
- It assumes that the utilization of the device is low enough that the verifier models can be run in the background without affecting the user experience.
All these are surmountable in practice, but they do require careful consideration and planning.
Conclusion
This series of posts has introduced the problem of data drift, how it can be detected, and a proposed solution for detecting it in air-gapped medical devices.
The solution presented here is very simple to implement. It doesn’t require any data to leave the hospital network, doesn’t require any human intervention, and doesn’t impose too much additional regulatory burden. In devices that are used intermittently (as in most clinical settings), the comparison predictions can be lazily computed during downtime, so it doesn’t even need more compute resources.
With all these advantages, it is a very attractive solution for small medical device startups like CareCam and Ocellivision. It allows us to monitor the performance of our AI algorithms without violating any data security policies, and it provides a simple way to detect data drift and overfitting.