A key part of deploying AI is continuously monitoring its performance in the real world. This is necessary because the input that the model sees tends to move away from the input that it was trained on, degrading its performance. This problem is known as data drift and is a well-understood problem with popular solutions that already exist for the general case. However, these solutions generally rely on being able to observe the data as it comes in.

Recently, I was speaking with some friends at CareCam, who presented me with an interesting challenge: doing this from inside a hospital’s air-gapped secure network. CareCam uses commodity hardware and bespoke AI to perform motion studies on people to diagnose medical problems. Their flagship product is FDA-approved to perform gait analysis with AI, providing results that would normally take an expensive motion studio to generate. As with any AI product, it is important for them to monitor its ongoing performance.

However, there’s a catch: when we are deploying AI in healthcare settings, all data must remain solely within the hospital network. Technicians are not even permitted to access any raw user data. At most, they may only view anonymized text logs and copy down whatever they need onto a sheet of paper. This restriction on visibility makes it very difficult to monitor the ongoing performance of AI. (With the appropriate approvals and infrastructure this restriction may be lifted, but that is well outside the ability of small medical device startups.)

As a cofounder of an AI-enabled medical device startup, this is a problem I’m also contending with. This post digests my research and understanding of this topic into some technical detail, suitable for someone with an academic background in ML applying it in the real world. The next posts in this series will introduce the related problem of data novelty budgets and my proposed solution.

Data Drift, Repeated Training, and the Generalization Gap

Any introductory ML course will present chart illustrating the performance of a model on a training set and a test set as training progresses. Performance on the training set is proxy for how much further training is possible. Performance on the test set is a proxy for expected performance on novel data during deployment, which is the actual metric we care about. This graph has two key takeaways: first, that model performance on novel data improves only to a point, after which it gets worse (through overfitting). Second, there is a gap between the performance on the training and test data, referred to as the generalization gap.

Basic training performance graph for an ML model. Basic training performance graph for an ML model.

Good ML courses will introduce an additional validation set, but we’ll conceptually wrestle with that in the next article.

This is a perfectly fine view for an ML course (or academia and education in general) because the dataset is fixed and the objective is to maximize performance on a preassigned test set. In practice, this doesn’t always work because the data distribution may change over time. This process is called data drift.

Data Drift

Data drift occurs as the generative process producing the data changes over time, moving the samples generated away from the data seen in the training set. As this happens, the model must extrapolate further away from its support, causing performance to degrade. EvidentlyAI, who maintain the eponymous open-source AI monitoring tool, have an excellent breakdown on the kinds of data drift.

A vision-based healthcare AI tool like CareCam accepts videos from commodity devices in hospital settings, and uses them to make a series of predictions. For tools like these, here are some possible sources of data drift:

  1. Changes in camera chip or optics used.
  2. Changes in upstream video processing algorithms, such as pose estimators.
  3. Moving into veterans’ hospital after only working in community hospitals.
  4. Countries with different electrical frequencies.
  5. Changes in software dependencies, even seemingly unrelated ones.
  6. Changes in insurance and reimbursement procedures.
  7. Wear and tear of mechanical components, especially if the model is only trained on data from new equipment.

The easily anticipated changes are those that directly affect the video sent to the product, like changes in hardware and lighting. These can be easily mitigated by pinning software dependencies, using reproducible build tools, and certifying specific hardware revisions. Much more difficult to anticipate are changes that indirectly influence someone coming to a hospital and a hospital using the product for diagnosis. Here are some examples of how this may work:

Veterans hospitals in the US serve military vererans, who are much more likely to have amputations and trauma-related mobility issues than the general population. If your model is trained exclusively on community hospitals, the rate at which you see these outcomes will be much higher than the model expects. A similar problem may occur with people of different socioeconomic backgrounds as changes in insurance and reimbursement.

Electrical frequency is a well-known problem in the video processing community. Most cameras suppress signals at the same frequency as the electricity to prevent strobing with cheap lights. If you move to a country with a different electrical frequency and the camera configuration is not appropriately set, the model may see a different distribution of data.

Regulator’s view

There is some draft guidance from the FDA that deals with topic. Depending on when you read this section, it may already be outdated.

Medical device regulators used to treat AI as a fancy measuring tool or blood test. You generally negotiate with the regulator (citing precedent where possible) the size and composition of your dataset, the details of your training, and the performance of your model on your validation set. Once the resultant model is certified, and only that exact model may be used for diagnoses. Updated models that incorporate new data need to be certified all over again at great expense. This system does not lend itself well to regularly incorporating new data to mitigate data drift.

Under the new recommendations, the FDA regulates AI-enabled device software functions (AI-DSFs) as a process that produces models, not as individual models. The FDA will certify the data acquisition process, training and testing pipeline, and deployment details; and the resultant models are permitted medical devices. It is likely that this will be the new standard for AI-DSFs as early as 2026.

The Standard Solution: Data Pipelines

The standard solution to mitigating data drift is simple: continuously sample data from your algorithm and use it for training. This is called a data pipeline.

In its simplest form, your data pipeline would sample model inputs and outputs from its interaction with end users and store that in a data warehouse (or lake). You would either use metadata (some sort of signal of correctness or feedback from the end user) or pay for human-generated labels to establish the true answer, and then include it in your next training run. It is not uncommon to automate data inclusion, training, and deployment; all done automatically; this is sometimes called MLOps.

Most data pipelines are built on the assumption that data is freely transferable from the deployed AI system to your monitoring and training infrastructure. In contrast, medical data is emcumbered by regulation, contract, or both. The only realistic option for a data pipeline is if it is run on hardware located entirely in the hospital data center, but that is prohibitively expensive and complex for a startup.

Instead, there is a simpler question suitable for medical device startups like CareCam and Ocellivision: can we easily detect that we are encountering data different from our training set? That minimal signal alone would be enough to alert us that something has changed so we can investigate, gather more data, and manually re-train.