Introduction

Monitoring incoming data for statistical drift is necessary to track whether assumptions made during model development are still valid in a production setting. For instance, a data scientist may assume that the values of a particular feature are normally distributed or the choice of encoding of a certain categorical variable may have been made with a certain multinomial distribution in mind. Tests should be run routinely against batches of live data and compared against the distribution of the training data to ensure that these assumptions are still valid, and if the tests fail, then appropriate alerts are raised for the data scientist or ModelOps engineer to investigate.

Details

As the same data set may serve several models, you can write one drift detection model to associate to several models. This association is made during the Model Lifecycle process. The drift model can compare the training data of the associated models to a given batch of data. The following is a simple example:

import pandas as pd
import numpy as np
from scipy.stats import ks_2samp
from scipy.stats import binom_test


# modelop.init
def begin():
    """
    A function to read training data and save it, along with it
    numerical features, globally.
    """
    
    global train, numerical_features
    
    train = pd.read_csv('training_data.csv')
    numerical_features = train.select_dtypes(['int64', 'float64']).columns


# modelop.metrics
def metrics(data):
    """
    A function to compute KS p-values on input (sample) data
    as compared to training (baseline data)
    """
    
    ks_tests = [ks_2samp(train.loc[:, feat], data.loc[:, feat]) \
                for feat in numerical_features]
    pvalues = [x[1] for x in ks_tests]
    ks_pvalues = dict(zip(numerical_features, pvalues))
    
    yield dict(pvalues=ks_pvalues)

This drift model executes a two-sample Kolmogorov-Smirnov test between numerical features of the training data and the incoming batch and reports the p-values. If the p-values are sufficiently large (over 0.01 or 0.05), you can assume that the two samples are similar. If the p-values are small, you can assume that these samples are different and generate an alert.

If the training data is too large to fit in memory, you can save summary statistics about the training data and save those as, e.g., a pickle file and read those statistics in during the init function of the drift model. The metrics function can contain other statistical tests to compare those statistics to the statistics of the incoming batch.

Next Article: Model Governance: Standard Model Definition >