Data & Concept Drift Monitoring
This article describes how ModelOp Center enables ongoing Data Drift and Concept Drift Monitoring.
Table of Contents
Introduction
Monitoring data - input and output (concept) - for drift is necessary to track whether assumptions made during model development are still valid in a production setting. For instance, a data scientist may assume that the values of a particular feature are normally distributed or the choice of encoding of a certain categorical variable may have been made with a certain multinomial distribution in mind. Tests should be run routinely against batches of live data and compared against the distribution of the training/reference data to ensure that these assumptions are still valid; if the tests fail, appropriate alerts should be raised for the data scientist or ModelOps engineer to investigate.
ModelOp Center provides a number of Drift monitors out-of-the-box (OOTB) but also allows you to write your own drift monitor. The subsequent sections describe how to add a drift monitor - assuming an OOTB monitor - and the detailed makeup of a drift monitor for multiple types of models.
Adding Drift Monitors
As background on the terminology and concepts used below, please read the Monitoring Concepts section of the Model overview documentation.
To add drift monitoring to your business model, you will add an existing “Monitor” to a snapshot (deployable model) of the business model under consideration. Below are the steps to accomplish this. For tutorial purposes, these instructions use all out-of-the-box and publicly available content provided by ModelOp, focusing on the German Credit Model and its related assets.
Associate a Monitor to a Snapshot of a Business Model
In MOC, navigate to the business model to be monitored. In our case, that’s the German Credit Model.
Navigate to the specific snapshot of the business model. If no snapshots exist, create one.
On the Monitors widget click on
+ Add
Search for (or select) the
Data Drift Monitor: Comprehensive Analysis
from the list of OOTB monitors.Select a snapshot of the monitor. By default, a snapshot is created for each OOTB monitor.
On the
Input Assets
page, you’ll notice that two assets are required: A baseline data asset and a sample data asset. This is because a drift monitor compares a slice of production data (sample) to a reference data set (baseline). For our example, selectdf_baseline_scored.json
as theBaseline Data Asset
anddf_sample_scored.json
as theSample Data Asset
. Since these files are already assets of the business model, we can find them underSelect Existing
.On the
Threshold
page, click onADD A THRESHOLD
, then select the.dmn
filedata_drift_DMN.dmn
. Since the file is already an asset of the business model, we can find it underSelect Existing
. If the business model does not have a.dmn
asset, the user may upload on from a local directory during the monitor association process. More on thresholds and decision tables in the next section.The last step in adding a monitor is adding an optional schedule. To do so, click on
ADD A SCHEDULE
. TheSchedule Name
field is free-form. TheSignal Name
field is a dropdown. Choose a signal that corresponds to your ticketing system (Jira, ServiceNow). Lastly, set the frequency of the monitoring job. This can be done either by the wizard or by entering a cron expression.On the
Review
page clickSAVE
.
Define thresholds for your model
As mentioned in the Monitoring Concepts article, ModelOp Center uses decision tables to define the thresholds within which the model should operate for the given monitor.
The first step is to define these thresholds. For this tutorial, we will leverage the example
data_drift_DMN.dmn
decision table. Specifically, this decision table ensures that thecredit_amount_ks_pvalue
andinstallment_rate_js_distance
metrics of the German Credit Model are within specification.credit_amount_ks_pvalue
is the p-value returned by the Kolmogorov-Smirnov 2-sample test, for the featurecredit_amount
. If the p-value is sufficiently large (say, for example over 0.05), you can assume that the two samples are similar. If the p-value is small, you can assume that these samples are different and generate an alert.The
credit_amount_ks_pvalue
andinstallment_rate_js_distance
values can be accessed directly from the Monitoring Test Result by design. More metrics are produced OOTB by the drift monitor. We will discuss this in more detail later.In our example, the
.dmn
file is already an asset of the business model and versioned/managed along with the source code in the same Github repo. This is considered best practice, as the decision tables are closely tied to the specific business model under consideration. However, it is not a requirement that the.dmn
files are available as model assets ahead of time.
Monitoring Results and Notifications
Sample Standard Output of Performance Monitors
The output of the performance monitoring job can be viewed by clicking on the monitor from “Model Test Results” as shown in the previous section. In this section, you can view the results in a graphical format or in the raw format.
Data Drift Model Test Results in a Graphical Format
Data Drift Model Test Result as Raw JSON
{ "duration_months_es_pvalue": 0.7865, "credit_amount_es_pvalue": 0.4227, "installment_rate_es_pvalue": 0.4236, "present_residence_since_es_pvalue": 0.3442, "age_years_es_pvalue": 0.0179, "number_existing_credits_es_pvalue": 0.6696, "number_people_liable_es_pvalue": null, "number_existing_credits_js_distance": 0.1662, "number_people_liable_js_distance": 0.1564, "present_residence_since_js_distance": 0.0959, "installment_rate_js_distance": 0.0923, "purpose_js_distance": 0.089, "credit_amount_js_distance": 0.0658, "age_years_js_distance": 0.0623, "present_employment_since_js_distance": 0.0609, "duration_months_js_distance": 0.0557, "savings_account_js_distance": 0.0471, "gender_js_distance": 0.0471, "credit_history_js_distance": 0.0357, "property_js_distance": 0.0348, "telephone_js_distance": 0.0262, "job_js_distance": 0.0244, "foreign_worker_js_distance": 0.0181, "checking_status_js_distance": 0.016, "installment_plans_js_distance": 0.0158, "housing_js_distance": 0.0103, "debtors_guarantors_js_distance": 0.0047, "duration_months_kl_divergence": 0.0152, "credit_amount_kl_divergence": 0.0191, "installment_rate_kl_divergence": 0.0089, "present_residence_since_kl_divergence": 0.0107, "age_years_kl_divergence": 0.0172, "number_existing_credits_kl_divergence": 0.005, "number_people_liable_kl_divergence": 0.0013, "checking_status_kl_divergence": 0.001, "credit_history_kl_divergence": 0.0053, "purpose_kl_divergence": 0.0336, "savings_account_kl_divergence": 0.0088, "present_employment_since_kl_divergence": 0.0148, "debtors_guarantors_kl_divergence": 0.0001, "property_kl_divergence": 0.0049, "installment_plans_kl_divergence": 0.001, "housing_kl_divergence": 0.0004, "job_kl_divergence": 0.0025, "telephone_kl_divergence": 0.0028, "foreign_worker_kl_divergence": 0.0013, "gender_kl_divergence": 0.0087, "duration_months_ks_pvalue": 0.4721, "credit_amount_ks_pvalue": 0.5733, "installment_rate_ks_pvalue": 0.7833, "present_residence_since_ks_pvalue": 0.8076, "age_years_ks_pvalue": 0.2495, "number_existing_credits_ks_pvalue": 1.0, "number_people_liable_ks_pvalue": 1.0, "data_drift": [ { "test_name": "Epps-Singleton", "test_category": "data_drift", "test_type": "epps_singleton", "metric": "p_value", "test_id": "data_drift_epps_singleton_p_value", "values": { "duration_months": 0.7865, "credit_amount": 0.4227, "installment_rate": 0.4236, "present_residence_since": 0.3442, "age_years": 0.0179, "number_existing_credits": 0.6696, "number_people_liable": null } }, { "test_name": "Jensen-Shannon", "test_category": "data_drift", "test_type": "jensen_shannon", "metric": "distance", "test_id": "data_drift_jensen_shannon_distance", "values": { "number_existing_credits": 0.1662, "number_people_liable": 0.1564, "present_residence_since": 0.0959, "installment_rate": 0.0923, "purpose": 0.089, "credit_amount": 0.0658, "age_years": 0.0623, "present_employment_since": 0.0609, "duration_months": 0.0557, "savings_account": 0.0471, "gender": 0.0471, "credit_history": 0.0357, "property": 0.0348, "telephone": 0.0262, "job": 0.0244, "foreign_worker": 0.0181, "checking_status": 0.016, "installment_plans": 0.0158, "housing": 0.0103, "debtors_guarantors": 0.0047 } }, { "test_name": "Kullback-Leibler", "test_category": "data_drift", "test_type": "kullback_leibler", "metric": "divergence", "test_id": "data_drift_kullback_leibler_divergence", "values": { "duration_months": 0.0152, "credit_amount": 0.0191, "installment_rate": 0.0089, "present_residence_since": 0.0107, "age_years": 0.0172, "number_existing_credits": 0.005, "number_people_liable": 0.0013, "checking_status": 0.001, "credit_history": 0.0053, "purpose": 0.0336, "savings_account": 0.0088, "present_employment_since": 0.0148, "debtors_guarantors": 0.0001, "property": 0.0049, "installment_plans": 0.001, "housing": 0.0004, "job": 0.0025, "telephone": 0.0028, "foreign_worker": 0.0013, "gender": 0.0087 } }, { "test_name": "Kolmogorov-Smirnov", "test_category": "data_drift", "test_type": "kolmogorov_smirnov", "metric": "p_value", "test_id": "data_drift_kolmogorov_smirnov_p_value", "values": { "duration_months": 0.4721, "credit_amount": 0.5733, "installment_rate": 0.7833, "present_residence_since": 0.8076, "age_years": 0.2495, "number_existing_credits": 1.0, "number_people_liable": 1.0 } }, { "test_name": "Summary", "test_category": "data_drift", "test_type": "summary", "metric": "pandas_describe", "test_id": "data_drift_summary_pandas_describe", "values": { "numerical_comparisons": { "duration_months": { "baseline": { "count": 800.0, "mean": 20.74375, "std": 12.056939835017488, "min": 4.0, "25%": 12.0, "50%": 18.0, "75%": 24.0, "max": 72.0 }, "sample": { "count": 200.0, "mean": 21.54, "std": 12.075491188236855, "min": 6.0, "25%": 12.0, "50%": 18.0, "75%": 24.0, "max": 60.0 } }, "credit_amount": "TRUNCATED", "installment_rate": "TRUNCATED" "present_residence_since": "TRUNCATED", "age_years": "TRUNCATED", "number_existing_credits": "TRUNCATED", "number_people_liable": "TRUNCATED", }, "categorical_comparisons": { "checking_status": { "baseline": { "count": 800, "unique": 4, "top": "A14", "freq": 313 }, "sample": { "count": 200, "unique": 4, "top": "A14", "freq": 81 } }, "credit_history": "TRUNCATED", "purpose": "TRUNCATED", "savings_account": "TRUNCATED", "present_employment_since": "TRUNCATED", "debtors_guarantors": "TRUNCATED", "property": "TRUNCATED", "installment_plans": "TRUNCATED", "housing": "TRUNCATED", "job": "TRUNCATED", "telephone": "TRUNCATED", "foreign_worker": "TRUNCATED", "gender": "TRUNCATED" } } } ] }
Drift Monitors Details
Choosing a drift monitor for a business model depends in practice on the particular model in consideration. For example, a binary classification model can be best monitored for concept drift by running a Summary
test (basic statistics), instead of a 2-sample test, since there are only two possible outcomes, and thus a very small range for the random variable. In addition, feature types (numerical vs categorical - also referred to in MOC terminology as dataClass
) play an important role in choosing the right monitor. Some monitors, such as Kullback-Libeler
(KL) accommodate both numerical and categorical data, whereas others (usually 2-sample tests such as Kolmogorov-Smirnov
or Epps-Singleton
) work only on numerical features.
This being said, model-type are feature dataClass
are the only abstractions to consider when choosing a drift monitor. Out-of-the-box monitoring takes care of the rest.
Out-of-the-Box Monitors
The following is the list of OOTB monitors that are currently implemented, as well as their source code from the SciPy library:
Epps-Singleton 2-Sample Test
epps_singleton_2samp — SciPy v1.14.1 Manual
Test to see if two samples have the same underlying distribution. Returns a p-value. Samples do not have to be continuous.
If the output of the Epps-Singleton test on two distributions is a p-value that is less than a certain threshold (ie. 0.05), then we can reject the null hypothesis that the two samples come from a similar underlying distribution. When applied to a feature (or a target variable) of a dataset, we can determine if there is drift between a baseline and a sample dataset in that feature (or target variable).
Kolmogorov-Smirnov 2-Sample Test
ks_2samp — SciPy v1.14.1 Manual
Test to see the goodness-of-fit of the underlying distributions of two samples. Returns a p-value. Only works on continuous distributions of data.
If the output of the Kolmogorov-Smirnov test on two distributions is a p-value that is less than a certain threshold (ie. 0.05), then we can reject the null hypothesis that the two samples have an identical underlying distribution. When applied to a feature (or a target variable) of a dataset, we can determine if there is drift between a baseline and a sample dataset in that feature (or target variable).
Jensen-Shannon Distance
jensenshannon — SciPy v1.14.1 Manual
Computes the Jensen-Shannon distance between two distributions, which the square root of the Jensen-Shannon divergence metric.
The output of the Jensen-Shannon distance calculation is not a p-value, like the Epps-Singleton or the Kolmogorov-Smirnov tests, but a distance. As such, there is not a one-case-fits-all or a universally accepted value that shows that the two distributions are significantly different. However, it is useful over time to keep track of how the distances of two distributions might change.
Kullback-Leibler Divergence
scipy.special.kl_div — SciPy v1.14.1 Manual
Computes the Kullback-Leibler divergence metric (also called relative entropy) between two distributions. Computes by bucketing the samples, computing the element-wise Kullbkac-Leibler divergence metric, then sums each bucket for the final divergence metric over the samples. Because the Kullback-Leibler divergence is asymmetric, the order in which the samples are input into the calculation might output slightly differing results. Also, it is possible that the metric might return a value of Inf
. In such a case, the samples are automatically reversed and calculated again.
The output of the Kullback-Leibler divergence calculation is a not a p-value (like the Epps-Singleton and Kolmogorov-Smirnov tests), nor is it a distance (like the Jensen-Shannon distance), but rather a metric to inform how diverged two distributions might be. Like the Jensen-Shannon distance, there is no one-case-fits-all or universally accepted value to determine if two distributions are significantly different, but the Kullback-Leibler divergence provides one more option in detecting possible drift.
Model Assumptions
Business Models considered for drift monitoring have a couple of requirements:
An extended schema asset for the input data.
Input data contains at least one
numerical
column and/or onecategorical
column. The exact requirement depends on the specific monitor being used.
Model Execution
During execution, drift monitors execute the following:
The
init
function extracts the extended input schema from job JSON.monitoring parameters are set based on the schema extracted previously.
numerical_columns
andcategorical_columns
are determined accordingly. In the case of concept drift monitoring,target_column
(score column) andlabel_type
(numerical
vs.categorical
) are determined at this step.The
metrics
function runs the appropriate drift monitoring test: Epps-Singleton, Jensen-Shannon, Kullback-Leibler, Kolmogorov-Smirnov, or Pandas.describe(). When the drift monitor (data drift or concept drift) is a comprehensive monitor, all the tests above are performed.Test results are appended to the list of
data_drift
orconcept_drift
tests to be returned by the model, and key-value pairs are added to the top-level of the output dictionary.
For a deeper look at OOTB drift monitors, see the GitHub READMEs:
Next Article: Statistical Monitoring >