Data & Concept Drift Monitoring

This article describes how ModelOp Center enables ongoing Data Drift and Concept Drift Monitoring.

Table of Contents

Introduction

Monitoring data - input and output (concept) - for drift is necessary to track whether assumptions made during model development are still valid in a production setting. For instance, a data scientist may assume that the values of a particular feature are normally distributed or the choice of encoding of a certain categorical variable may have been made with a certain multinomial distribution in mind. Tests should be run routinely against batches of live data and compared against the distribution of the training/reference data to ensure that these assumptions are still valid; if the tests fail, appropriate alerts should be raised for the data scientist or ModelOps engineer to investigate.

ModelOp Center provides a number of Drift monitors out-of-the-box (OOTB) but also allows you to write your own drift monitor. The subsequent sections describe how to add a drift monitor - assuming an OOTB monitor - and the detailed makeup of a drift monitor for multiple types of models.

Adding Drift Monitors

As background on the terminology and concepts used below, please read the Monitoring Concepts section of the Model overview documentation.

To add drift monitoring to your business model, you will add an existing “Monitor” to a snapshot (deployable model) of the business model under consideration. Below are the steps to accomplish this. For tutorial purposes, these instructions use all out-of-the-box and publicly available content provided by ModelOp, focusing on the German Credit Model and its related assets.

Associate a Monitor to a Snapshot of a Business Model

  1. In MOC, navigate to the business model to be monitored. In our case, that’s the German Credit Model.

  2. Navigate to the specific snapshot of the business model. If no snapshots exist, create one.

    1. On the Monitors widget click on + Add

    2. Search for (or select) the Data Drift Monitor: Comprehensive Analysis from the list of OOTB monitors.

    3. Select a snapshot of the monitor. By default, a snapshot is created for each OOTB monitor. 

    4. On the Input Assets page, you’ll notice that two assets are required: A baseline data asset and a sample data asset. This is because a drift monitor compares a slice of production data (sample) to a reference data set (baseline). For our example, select df_baseline_scored.json as the Baseline Data Asset and df_sample_scored.json as the Sample Data Asset. Since these files are already assets of the business model, we can find them under Select Existing.

       

    5. On the Threshold page, click on ADD A THRESHOLD, then select the .dmn file data_drift_DMN.dmn. Since the file is already an asset of the business model, we can find it under Select Existing. If the business model does not have a .dmn asset, the user may upload on from a local directory during the monitor association process. More on thresholds and decision tables in the next section.

    6. The last step in adding a monitor is adding an optional schedule. To do so, click on ADD A SCHEDULE. The Schedule Name field is free-form. The Signal Name field is a dropdown. Choose a signal that corresponds to your ticketing system (Jira, ServiceNow). Lastly, set the frequency of the monitoring job. This can be done either by the wizard or by entering a cron expression.

    7. On the Review page click SAVE.

Define thresholds for your model

As mentioned in the Monitoring Concepts article, ModelOp Center uses decision tables to define the thresholds within which the model should operate for the given monitor.

  • The first step is to define these thresholds. For this tutorial, we will leverage the example data_drift_DMN.dmn decision table. Specifically, this decision table ensures that the credit_amount_ks_pvalue and installment_rate_js_distance metrics of the German Credit Model are within specification. credit_amount_ks_pvalue is the p-value returned by the Kolmogorov-Smirnov 2-sample test, for the feature credit_amount. If the p-value is sufficiently large (say, for example over 0.05), you can assume that the two samples are similar. If the p-value is small, you can assume that these samples are different and generate an alert.

  • The credit_amount_ks_pvalue and installment_rate_js_distance values can be accessed directly from the Monitoring Test Result by design. More metrics are produced OOTB by the drift monitor. We will discuss this in more detail later.

  • In our example, the .dmn file is already an asset of the business model and versioned/managed along with the source code in the same Github repo. This is considered best practice, as the decision tables are closely tied to the specific business model under consideration. However, it is not a requirement that the .dmn files are available as model assets ahead of time.

Monitoring Results and Notifications

Sample Standard Output of Performance Monitors

The output of the performance monitoring job can be viewed by clicking on the monitor from “Model Test Results” as shown in the previous section. In this section, you can view the results in a graphical format or in the raw format.

  1. Data Drift Model Test Results in a Graphical Format

     

  2. Data Drift Model Test Result as Raw JSON

    { "duration_months_es_pvalue": 0.7865, "credit_amount_es_pvalue": 0.4227, "installment_rate_es_pvalue": 0.4236, "present_residence_since_es_pvalue": 0.3442, "age_years_es_pvalue": 0.0179, "number_existing_credits_es_pvalue": 0.6696, "number_people_liable_es_pvalue": null, "number_existing_credits_js_distance": 0.1662, "number_people_liable_js_distance": 0.1564, "present_residence_since_js_distance": 0.0959, "installment_rate_js_distance": 0.0923, "purpose_js_distance": 0.089, "credit_amount_js_distance": 0.0658, "age_years_js_distance": 0.0623, "present_employment_since_js_distance": 0.0609, "duration_months_js_distance": 0.0557, "savings_account_js_distance": 0.0471, "gender_js_distance": 0.0471, "credit_history_js_distance": 0.0357, "property_js_distance": 0.0348, "telephone_js_distance": 0.0262, "job_js_distance": 0.0244, "foreign_worker_js_distance": 0.0181, "checking_status_js_distance": 0.016, "installment_plans_js_distance": 0.0158, "housing_js_distance": 0.0103, "debtors_guarantors_js_distance": 0.0047, "duration_months_kl_divergence": 0.0152, "credit_amount_kl_divergence": 0.0191, "installment_rate_kl_divergence": 0.0089, "present_residence_since_kl_divergence": 0.0107, "age_years_kl_divergence": 0.0172, "number_existing_credits_kl_divergence": 0.005, "number_people_liable_kl_divergence": 0.0013, "checking_status_kl_divergence": 0.001, "credit_history_kl_divergence": 0.0053, "purpose_kl_divergence": 0.0336, "savings_account_kl_divergence": 0.0088, "present_employment_since_kl_divergence": 0.0148, "debtors_guarantors_kl_divergence": 0.0001, "property_kl_divergence": 0.0049, "installment_plans_kl_divergence": 0.001, "housing_kl_divergence": 0.0004, "job_kl_divergence": 0.0025, "telephone_kl_divergence": 0.0028, "foreign_worker_kl_divergence": 0.0013, "gender_kl_divergence": 0.0087, "duration_months_ks_pvalue": 0.4721, "credit_amount_ks_pvalue": 0.5733, "installment_rate_ks_pvalue": 0.7833, "present_residence_since_ks_pvalue": 0.8076, "age_years_ks_pvalue": 0.2495, "number_existing_credits_ks_pvalue": 1.0, "number_people_liable_ks_pvalue": 1.0, "data_drift": [ { "test_name": "Epps-Singleton", "test_category": "data_drift", "test_type": "epps_singleton", "metric": "p_value", "test_id": "data_drift_epps_singleton_p_value", "values": { "duration_months": 0.7865, "credit_amount": 0.4227, "installment_rate": 0.4236, "present_residence_since": 0.3442, "age_years": 0.0179, "number_existing_credits": 0.6696, "number_people_liable": null } }, { "test_name": "Jensen-Shannon", "test_category": "data_drift", "test_type": "jensen_shannon", "metric": "distance", "test_id": "data_drift_jensen_shannon_distance", "values": { "number_existing_credits": 0.1662, "number_people_liable": 0.1564, "present_residence_since": 0.0959, "installment_rate": 0.0923, "purpose": 0.089, "credit_amount": 0.0658, "age_years": 0.0623, "present_employment_since": 0.0609, "duration_months": 0.0557, "savings_account": 0.0471, "gender": 0.0471, "credit_history": 0.0357, "property": 0.0348, "telephone": 0.0262, "job": 0.0244, "foreign_worker": 0.0181, "checking_status": 0.016, "installment_plans": 0.0158, "housing": 0.0103, "debtors_guarantors": 0.0047 } }, { "test_name": "Kullback-Leibler", "test_category": "data_drift", "test_type": "kullback_leibler", "metric": "divergence", "test_id": "data_drift_kullback_leibler_divergence", "values": { "duration_months": 0.0152, "credit_amount": 0.0191, "installment_rate": 0.0089, "present_residence_since": 0.0107, "age_years": 0.0172, "number_existing_credits": 0.005, "number_people_liable": 0.0013, "checking_status": 0.001, "credit_history": 0.0053, "purpose": 0.0336, "savings_account": 0.0088, "present_employment_since": 0.0148, "debtors_guarantors": 0.0001, "property": 0.0049, "installment_plans": 0.001, "housing": 0.0004, "job": 0.0025, "telephone": 0.0028, "foreign_worker": 0.0013, "gender": 0.0087 } }, { "test_name": "Kolmogorov-Smirnov", "test_category": "data_drift", "test_type": "kolmogorov_smirnov", "metric": "p_value", "test_id": "data_drift_kolmogorov_smirnov_p_value", "values": { "duration_months": 0.4721, "credit_amount": 0.5733, "installment_rate": 0.7833, "present_residence_since": 0.8076, "age_years": 0.2495, "number_existing_credits": 1.0, "number_people_liable": 1.0 } }, { "test_name": "Summary", "test_category": "data_drift", "test_type": "summary", "metric": "pandas_describe", "test_id": "data_drift_summary_pandas_describe", "values": { "numerical_comparisons": { "duration_months": { "baseline": { "count": 800.0, "mean": 20.74375, "std": 12.056939835017488, "min": 4.0, "25%": 12.0, "50%": 18.0, "75%": 24.0, "max": 72.0 }, "sample": { "count": 200.0, "mean": 21.54, "std": 12.075491188236855, "min": 6.0, "25%": 12.0, "50%": 18.0, "75%": 24.0, "max": 60.0 } }, "credit_amount": "TRUNCATED", "installment_rate": "TRUNCATED" "present_residence_since": "TRUNCATED", "age_years": "TRUNCATED", "number_existing_credits": "TRUNCATED", "number_people_liable": "TRUNCATED", }, "categorical_comparisons": { "checking_status": { "baseline": { "count": 800, "unique": 4, "top": "A14", "freq": 313 }, "sample": { "count": 200, "unique": 4, "top": "A14", "freq": 81 } }, "credit_history": "TRUNCATED", "purpose": "TRUNCATED", "savings_account": "TRUNCATED", "present_employment_since": "TRUNCATED", "debtors_guarantors": "TRUNCATED", "property": "TRUNCATED", "installment_plans": "TRUNCATED", "housing": "TRUNCATED", "job": "TRUNCATED", "telephone": "TRUNCATED", "foreign_worker": "TRUNCATED", "gender": "TRUNCATED" } } } ] }

Drift Monitors Details

Choosing a drift monitor for a business model depends in practice on the particular model in consideration. For example, a binary classification model can be best monitored for concept drift by running a Summary test (basic statistics), instead of a 2-sample test, since there are only two possible outcomes, and thus a very small range for the random variable. In addition, feature types (numerical vs categorical - also referred to in MOC terminology as dataClass) play an important role in choosing the right monitor. Some monitors, such as Kullback-Liebler (KL) accommodate both numerical and categorical data, whereas others (usually 2-sample tests such as Kolmogorov-Smirnov or Epps-Singleton) work only on numerical features.

This being said, model-type are feature dataClass are the only abstractions to consider when choosing a drift monitor. Out-of-the-box monitoring takes care of the rest.

Out-of-the-Box Monitors

The following is the list of OOTB monitors that are currently implemented, as well as their source code from the SciPy library:

  • Epps-Singleton 2-Sample Test

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.epps_singleton_2samp.html

Test to see if two samples have the same underlying distribution. Returns a p-value. Samples do not have to be continuous.

If the output of the Epps-Singleton test on two distributions is a p-value that is less than a certain threshold (i.e. 0.05), then we can reject the null hypothesis that the two samples come from a similar underlying distribution. When applied to a feature (or a target variable) of a dataset, we can determine if there is drift between a baseline and a sample dataset in that feature (or target variable).

Remarks:

  1. Null values in the samples will cause the Epps-Singleton test to fail. As such, null values are dropped when calculating the Epps-Singleton test.

  2. The Epps-Singleton test will fail when there are less than five values in each sample. In such cases, the Epps-Singleton test will return a null metric

 

  • Kolmogorov-Smirnov 2-Sample Test

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html#scipy.stats.ks_2samp

Test to see the goodness-of-fit of the underlying distributions of two samples. Returns a p-value. Only works on continuous distributions of data.

If the output of the Kolmogorov-Smirnov test on two distributions is a p-value that is less than a certain threshold (i.e. 0.05), then we can reject the null hypothesis that the two samples have an identical underlying distribution. When applied to a feature (or a target variable) of a dataset, we can determine if there is drift between a baseline and a sample dataset in that feature (or target variable).

 

  • Jensen-Shannon Distance

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html

Computes the Jensen-Shannon distance between two distributions, which is the square root of the Jensen-Shannon divergence metric.

The output of the Jensen-Shannon distance calculation is not a p-value, like the Epps-Singleton or the Kolmogorov-Smirnov tests, but a distance. As such, there is not a one-case-fits-all or a universally accepted value that shows that the two distributions are significantly different. However, it is useful over time to keep track of how the distances of two distributions might change.

Remarks:

  1. Null values in the samples will cause the Jensen-Shannon distance to fail. As such, null values are dropped when calculating the Jensen-Shannon distance.

  2. Because the Jensen-Shannon distance attempts to fit a Gaussian KDE on the samples, an error occurs when there is little to no variance in the samples (i.e. all constant values). In such cases, the Jensen-Shannon distance will return a null metric.

 

  • Kullback-Leibler Divergence

Computes the Kullback-Leibler divergence metric (also called relative entropy) between two distributions. Computes by bucketing the samples, computing the element-wise Kullback-Leibler divergence metric, then sums each bucket for the final divergence metric over the samples. Because the Kullback-Leibler divergence is asymmetric, the order in which the samples are input into the calculation might output slightly differing results.

The output of the Kullback-Leibler divergence calculation is a not a p-value (like the Epps-Singleton and Kolmogorov-Smirnov tests), nor is it a distance (like the Jensen-Shannon distance), but rather a metric to inform how divergent two distributions might be. Like the Jensen-Shannon distance, there is no one-case-fits-all or universally accepted value to determine if two distributions are significantly different, but the Kullback-Leibler divergence provides one more option in detecting possible drift.

Remarks:

  1. It is possible that the Kullback-Leibler Divergence will return a value of Inf (when the support of one sample is not contained within the support of the other sample, or when one sample distribution has a much “wider tail” than the other). In such cases, the order of the samples will be reversed and the Kullback-Leibler Divergence will be recalculated (with an appropriate logger.warning raised). However, in the case that even the reversed order of samples returns Inf, the Kullback-Leibler Divergence will return a null metric.

 

Model Assumptions


Business Models considered for drift monitoring have a couple of requirements:

  1. An extended schema asset for the input data.

  2. Input data contains at least one numerical column and/or one categorical column. The exact requirement depends on the specific monitor being used.

Model Execution

During execution, drift monitors execute the following:

  1. The init function extracts the extended input schema from job JSON.

  2. monitoring parameters are set based on the schema extracted previously. numerical_columns and categorical_columns are determined accordingly. In the case of concept drift monitoring, target_column (score column) and label_type (numerical vs. categorical) are determined at this step.

  3. The metrics function runs the appropriate drift monitoring test: Epps-SingletonJensen-ShannonKullback-LeiblerKolmogorov-Smirnov, or Pandas.describe(). When the drift monitor (data drift or concept drift) is a comprehensive monitor, all the tests above are performed.

  4. Test results are appended to the list of data_drift or concept_drift tests to be returned by the model, and key-value pairs are added to the top-level of the output dictionary.

For a deeper look at OOTB drift monitors, see the GitHub READMEs:

 

Next Article: Statistical Monitoring >