Data & Concept Drift Monitoring

This article describes how ModelOp Center enables ongoing Data Drift and Concept Drift Monitoring.

Table of Contents

Introduction

Monitoring data - input and output (concept) - for drift is necessary to track whether assumptions made during model development are still valid in a production setting. For instance, a data scientist may assume that the values of a particular feature are normally distributed or the choice of encoding of a certain categorical variable may have been made with a certain multinomial distribution in mind. Tests should be run routinely against batches of live data and compared against the distribution of the training/reference data to ensure that these assumptions are still valid; if the tests fail, appropriate alerts should be raised for the data scientist or ModelOps engineer to investigate.

ModelOp Center provides a number of Drift monitors out-of-the-box (OOTB) but also allows you to write your own drift monitor. The subsequent sections describe how to add a drift monitor - assuming an OOTB monitor - and the detailed makeup of a drift monitor for multiple types of models.

Adding Drift Monitors

As background on the terminology and concepts used below, please read the Monitoring Concepts section of the Model overview documentation.

To add drift monitoring to your business model, you will add an existing “Monitor” to a snapshot (deployable model) of the business model under consideration. Below are the steps to accomplish this. For tutorial purposes, these instructions use all out-of-the-box and publicly available content provided by ModelOp, focusing on the German Credit Model and its related assets.

Associate a Monitor to a Snapshot of a Business Model

  1. In MOC, navigate to the business model to be monitored. In our example here, that’s the German Credit Model.

  2. Navigate to the specific snapshot of the business model. If no snapshots exist, create one.

  3. On the Monitoring tab, click on + Add, then click on Monitor

  4. Search for (or select) the Data Drift Monitor: Comprehensive Analysis from the list of OOTB monitors.

  5. Select a snapshot of the monitor. By default, a snapshot is created for each OOTB monitor

  6. On the Input Assets page, you’ll notice that two assets are required: A baseline data asset and a sample data asset. This is because a drift monitor compares a slice of production data (sample) to a reference data set (baseline). For our example, select df_baseline_scored.json as the Baseline Data Asset and df_sample_scored.json as the Sample Data Asset. Since these files are already assets of the business model, we can find them under Select Existing

  7. On the Threshold page, click on ADD A THRESHOLD, then select the .dmn file data_drift_DMN.dmn. Since the file is already an asset of the business model, we can find it under Select Existing. If the business model does not have a .dmn asset, the user may upload on from a local directory during the monitor association process. More on thresholds and decision tables in the next section.

  8. The last step in adding a monitor is adding an optional schedule. To do so, click on ADD A SCHEDULE. The Schedule Name field is free-form. The Signal Name field is a dropdown. Choose a signal that corresponds to your ticketing system (Jira, ServiceNow). Lastly, set the frequency of the monitoring job. This can be done either by the wizard or by entering a cron expression. Note: schedules are optional; a monitor may be run on-demand from the business model’s snapshot page, or by a curl command.

  9. On the Review page click SAVE

To run a monitor on demand, click on COPY CURL TO RUN JOB EXTERNALLY. The CURL command can then be run from the application of your choosing.

Define thresholds for your model

As mentioned in the Monitoring Concepts article, ModelOp Center uses decision tables to define the thresholds within which the model should operate for the given monitor.

  • The first step is to define these thresholds. For this tutorial, we will leverage the example data_drift_DMN.dmn decision table. Specifically, this decision table ensures that the credit_amount_ks_pvalue and installment_rate_js_distance metrics of the German Credit Model are within specification. credit_amount_ks_pvalue is the p-value returned by the Kolmogorov-Smirnov 2-sample test, for the feature credit_amount. If the p-value is sufficiently large (say, for example over 0.05), you can assume that the two samples are similar. If the p-value is small, you can assume that these samples are different and generate an alert.

  • The credit_amount_ks_pvalue and installment_rate_js_distance values can be accessed directly from the Monitoring Test Result by design. More metrics are produced OOTB by the drift monitor. We will discuss this in more detail later.

  • In our example, the .dmn file is already an asset of the business model and versioned/managed along with the source code in the same Github repo. This is considered best practice, as the decision tables are closely tied to the specific business model under consideration. However, it is not a requirement that the .dmn files are available as model assets ahead of time.

Run a Monitor On-demand (UI)

To run a monitor on-demand from the UI, navigate to the business model’s snapshot page and click the play button next to the monitor of interest. A monitoring job will be initiated, and you will be redirected to the corresponding job page once the job is created.

Schedule a Monitor DIY (CURL)

Monitors can be scheduled to run using your preferred enterprise scheduling capability (Control-M, Airflow, Autosys, etc.) While the details will depend on the specific scheduling software, at the highest level, the user simply needs to create a REST call to the ModelOp Center API. Here are the steps:

  1. Obtain the Business Model snapshot’s UUID. This can be found, for instance, in the URL of the snapshot page, as shown in this example:

  2. Similarly, obtain the Monitoring Model snapshot’s UUID.

  3. Within the scheduler, configure the REST call to ModelOp Center’s automation engine to trigger the monitor for your model:

    1. Obtain a valid auth token

    2. Make a call (POST) to the ModelOp Center API to initiate the monitor. The endpoint is

      <MOC_INSTANCE_URL>/mlc-service/rest/signal
    3. The body should contain references to the Model Life Cycle (MLC) being triggered, as well as the business model and monitor snapshots, as shown below:

      { "name": "com.modelop.mlc.definitions.Signals_Run_Associated_Model_Jira", "variables": { "DEPLOYABLE_MODEL_ID" : { "value": <UUID_of_business_model_snapshot_as_a_string> }, "ASSOCIATED_MODEL_ID": { "value": <UUID_of_monitoring_model_snapshot_as_a_string> } } }

This process is made easier by copying the CURL command provided at the last step of the monitoring wizard

The copied command will look something like this:

curl 'http://localhost:8090/mlc-service/rest/signalResponsive' -H 'Accept: application/json, text/plain, /' -H 'Content-Type: application/json' -X POST -H 'Authorization: Bearer <token>' --data-raw '{"name":"com.modelop.mlc.definitions.Signals_Run_Associated_Model_Jira","variables":{"DEPLOYABLE_MODEL_ID":{"value":"23282688-62a6-47ae-8603-16f380efca57"},"ASSOCIATED_MODEL_ID":{"value":"1dc64c1e-3634-4e2e-b37d-71d04a9ee5ef"}}}'

Monitoring Execution

Once the scheduler triggers the signal, the corresponding MLC (listening to that signal) will be initiated. The sequence of events include:

  1. Preparing the monitoring job with all artifacts necessary to run the job

  2. Creating the monitoring job

  3. Parsing the results into viewable test results

  4. Comparing the results against the thresholds in the decision table

  5. Taking action, which could include creating a notification and/or opening up an incident in JIRA/ServiceNow/etc.

These steps can be summarized in the following Model Life Cycle (MLC)

Monitoring Results and Notifications

Sample Standard Output of Data Drift Monitors

Monitoring Test Results are listed under the Test Results table:

Upon clicking on the “View” icon, you’ll have two options for looking at test results: graphical (under “Test Results”), and raw (under “Raw Results”).

Visual elements

  1. Summary Metrics: these are a subset of all metrics computed by the monitor, returned as key:value pairs for ease-of-reference. Below is a portion of the table:

  2. Data Drift Metrics

    1. Summary Metrics

    2. Kolmogorov-Smirnov p-values

    3. Jensen-Shannon distances

    4. Kullback-Leibler divergences

    5. Epps-Singleton p-values

Raw Results

The “Raw Results” tab shows a clickable (expandable and collapsable) JSON representation of the test results.

To get a JSON file of the test results,

  1. Navigate to the “Jobs” tab of the snapshot and click on “Details” next to the monitoring job of interest

  2. Click on “Download File” under “Outputs”

 

{ "duration_months_es_pvalue": 0.7865, "credit_amount_es_pvalue": 0.4227, "installment_rate_es_pvalue": 0.4236, "present_residence_since_es_pvalue": 0.3442, "age_years_es_pvalue": 0.0179, "number_existing_credits_es_pvalue": 0.6696, "number_people_liable_es_pvalue": null, "number_existing_credits_js_distance": 0.1662, "number_people_liable_js_distance": 0.1564, "present_residence_since_js_distance": 0.0959, "installment_rate_js_distance": 0.0923, "purpose_js_distance": 0.089, "credit_amount_js_distance": 0.0658, "age_years_js_distance": 0.0623, "present_employment_since_js_distance": 0.0609, "duration_months_js_distance": 0.0557, "savings_account_js_distance": 0.0471, "gender_js_distance": 0.0471, "credit_history_js_distance": 0.0357, "property_js_distance": 0.0348, "telephone_js_distance": 0.0262, "job_js_distance": 0.0244, "foreign_worker_js_distance": 0.0181, "checking_status_js_distance": 0.016, "installment_plans_js_distance": 0.0158, "housing_js_distance": 0.0103, "debtors_guarantors_js_distance": 0.0047, "duration_months_kl_divergence": 0.0152, "credit_amount_kl_divergence": 0.0191, "installment_rate_kl_divergence": 0.0089, "present_residence_since_kl_divergence": 0.0107, "age_years_kl_divergence": 0.0172, "number_existing_credits_kl_divergence": 0.005, "number_people_liable_kl_divergence": 0.0013, "checking_status_kl_divergence": 0.001, "credit_history_kl_divergence": 0.0053, "purpose_kl_divergence": 0.0336, "savings_account_kl_divergence": 0.0088, "present_employment_since_kl_divergence": 0.0148, "debtors_guarantors_kl_divergence": 0.0001, "property_kl_divergence": 0.0049, "installment_plans_kl_divergence": 0.001, "housing_kl_divergence": 0.0004, "job_kl_divergence": 0.0025, "telephone_kl_divergence": 0.0028, "foreign_worker_kl_divergence": 0.0013, "gender_kl_divergence": 0.0087, "duration_months_ks_pvalue": 0.4721, "credit_amount_ks_pvalue": 0.5733, "installment_rate_ks_pvalue": 0.7833, "present_residence_since_ks_pvalue": 0.8076, "age_years_ks_pvalue": 0.2495, "number_existing_credits_ks_pvalue": 1.0, "number_people_liable_ks_pvalue": 1.0, "data_drift": [ { "test_name": "Epps-Singleton", "test_category": "data_drift", "test_type": "epps_singleton", "metric": "p_value", "test_id": "data_drift_epps_singleton_p_value", "values": { "duration_months": 0.7865, "credit_amount": 0.4227, "installment_rate": 0.4236, "present_residence_since": 0.3442, "age_years": 0.0179, "number_existing_credits": 0.6696, "number_people_liable": null } }, { "test_name": "Jensen-Shannon", "test_category": "data_drift", "test_type": "jensen_shannon", "metric": "distance", "test_id": "data_drift_jensen_shannon_distance", "values": { "number_existing_credits": 0.1662, "number_people_liable": 0.1564, "present_residence_since": 0.0959, "installment_rate": 0.0923, "purpose": 0.089, "credit_amount": 0.0658, "age_years": 0.0623, "present_employment_since": 0.0609, "duration_months": 0.0557, "savings_account": 0.0471, "gender": 0.0471, "credit_history": 0.0357, "property": 0.0348, "telephone": 0.0262, "job": 0.0244, "foreign_worker": 0.0181, "checking_status": 0.016, "installment_plans": 0.0158, "housing": 0.0103, "debtors_guarantors": 0.0047 } }, { "test_name": "Kullback-Leibler", "test_category": "data_drift", "test_type": "kullback_leibler", "metric": "divergence", "test_id": "data_drift_kullback_leibler_divergence", "values": { "duration_months": 0.0152, "credit_amount": 0.0191, "installment_rate": 0.0089, "present_residence_since": 0.0107, "age_years": 0.0172, "number_existing_credits": 0.005, "number_people_liable": 0.0013, "checking_status": 0.001, "credit_history": 0.0053, "purpose": 0.0336, "savings_account": 0.0088, "present_employment_since": 0.0148, "debtors_guarantors": 0.0001, "property": 0.0049, "installment_plans": 0.001, "housing": 0.0004, "job": 0.0025, "telephone": 0.0028, "foreign_worker": 0.0013, "gender": 0.0087 } }, { "test_name": "Kolmogorov-Smirnov", "test_category": "data_drift", "test_type": "kolmogorov_smirnov", "metric": "p_value", "test_id": "data_drift_kolmogorov_smirnov_p_value", "values": { "duration_months": 0.4721, "credit_amount": 0.5733, "installment_rate": 0.7833, "present_residence_since": 0.8076, "age_years": 0.2495, "number_existing_credits": 1.0, "number_people_liable": 1.0 } }, { "test_name": "Summary", "test_category": "data_drift", "test_type": "summary", "metric": "pandas_describe", "test_id": "data_drift_summary_pandas_describe", "values": { "numerical_comparisons": { "duration_months": { "baseline": { "count": 800.0, "mean": 20.74375, "std": 12.056939835017488, "min": 4.0, "25%": 12.0, "50%": 18.0, "75%": 24.0, "max": 72.0 }, "sample": { "count": 200.0, "mean": 21.54, "std": 12.075491188236855, "min": 6.0, "25%": 12.0, "50%": 18.0, "75%": 24.0, "max": 60.0 } }, "credit_amount": "TRUNCATED", "installment_rate": "TRUNCATED" "present_residence_since": "TRUNCATED", "age_years": "TRUNCATED", "number_existing_credits": "TRUNCATED", "number_people_liable": "TRUNCATED", }, "categorical_comparisons": { "checking_status": { "baseline": { "count": 800, "unique": 4, "top": "A14", "freq": 313 }, "sample": { "count": 200, "unique": 4, "top": "A14", "freq": 81 } }, "credit_history": "TRUNCATED", "purpose": "TRUNCATED", "savings_account": "TRUNCATED", "present_employment_since": "TRUNCATED", "debtors_guarantors": "TRUNCATED", "property": "TRUNCATED", "installment_plans": "TRUNCATED", "housing": "TRUNCATED", "job": "TRUNCATED", "telephone": "TRUNCATED", "foreign_worker": "TRUNCATED", "gender": "TRUNCATED" } } } ] }

Note that the top key:value pairs are what gets shown in the “Summary Metrics” table.

Sample Monitoring Notification

Notifications arising from monitoring jobs can be found under the corresponding model test result.

If a ticketing system is configured in ModelOp Center, such as Jira, a ticket will be written when an ERROR occurs (as in above), and a link to the ticket will be available next to the notification. In the example above, a metric fell out of a preset threshold, and thus the monitoring job failed.

Drift Monitors Details

Choosing a drift monitor for a business model depends in practice on the particular model in consideration. For example, a binary classification model can be best monitored for concept drift by running a Summary test (basic statistics), instead of a 2-sample test, since there are only two possible outcomes, and thus a very small range for the random variable. In addition, feature types (numerical vs categorical - also referred to in MOC terminology as dataClass) play an important role in choosing the right monitor. Some monitors, such as Kullback-Liebler (KL) accommodate both numerical and categorical data, whereas others (usually 2-sample tests such as Kolmogorov-Smirnov or Epps-Singleton) work only on numerical features.

This being said, model-type are feature dataClass are the only abstractions to consider when choosing a drift monitor. Out-of-the-box monitoring takes care of the rest.

Out-of-the-Box Monitors

The following is the list of OOTB monitors that are currently implemented, as well as their source code from the SciPy library:

  • Epps-Singleton 2-Sample Test

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.epps_singleton_2samp.html

Test to see if two samples have the same underlying distribution. Returns a p-value. Samples do not have to be continuous.

If the output of the Epps-Singleton test on two distributions is a p-value that is less than a certain threshold (i.e. 0.05), then we can reject the null hypothesis that the two samples come from a similar underlying distribution. When applied to a feature (or a target variable) of a dataset, we can determine if there is drift between a baseline and a sample dataset in that feature (or target variable).

Remarks:

  1. Null values in the samples will cause the Epps-Singleton test to fail. As such, null values are dropped when calculating the Epps-Singleton test.

  2. The Epps-Singleton test will fail when there are less than five values in each sample. In such cases, the Epps-Singleton test will return a null metric

 

  • Kolmogorov-Smirnov 2-Sample Test

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html#scipy.stats.ks_2samp

Test to see the goodness-of-fit of the underlying distributions of two samples. Returns a p-value. Only works on continuous distributions of data.

If the output of the Kolmogorov-Smirnov test on two distributions is a p-value that is less than a certain threshold (i.e. 0.05), then we can reject the null hypothesis that the two samples have an identical underlying distribution. When applied to a feature (or a target variable) of a dataset, we can determine if there is drift between a baseline and a sample dataset in that feature (or target variable).

 

  • Jensen-Shannon Distance

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html

Computes the Jensen-Shannon distance between two distributions, which is the square root of the Jensen-Shannon divergence metric.

The output of the Jensen-Shannon distance calculation is not a p-value, like the Epps-Singleton or the Kolmogorov-Smirnov tests, but a distance. As such, there is not a one-case-fits-all or a universally accepted value that shows that the two distributions are significantly different. However, it is useful to keep track of how the distances of two distributions might change over time.

Remarks:

  1. Null values in the samples will cause the Jensen-Shannon distance to fail. As such, null values are dropped when calculating the Jensen-Shannon distance.

  2. Because the Jensen-Shannon distance attempts to fit a Gaussian KDE on the samples, an error occurs when there is little to no variance in the samples (i.e. all constant values). In such cases, the Jensen-Shannon distance will return a null metric.

 

  • Kullback-Leibler Divergence

https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.kl_div.html

Computes the Kullback-Leibler divergence metric (also called relative entropy) between two distributions. Computes by bucketing the samples, computing the element-wise Kullback-Leibler divergence metric, then sums each bucket for the final divergence metric over the samples. Because the Kullback-Leibler divergence is asymmetric, the order in which the samples are input into the calculation might output slightly differing results.

The output of the Kullback-Leibler divergence calculation is not a p-value (like the Epps-Singleton and Kolmogorov-Smirnov tests), nor is it a distance (like the Jensen-Shannon distance), but rather a metric to inform how divergent two distributions might be. Like the Jensen-Shannon distance, there is no one-case-fits-all or universally accepted value to determine if two distributions are significantly different, but the Kullback-Leibler divergence provides one more option in detecting possible drift.

Remarks:

  1. It is possible that the Kullback-Leibler Divergence will return a value of Inf (when the support of one sample is not contained within the support of the other sample, or when one sample distribution has a much “wider tail” than the other). In such cases, the order of the samples will be reversed and the Kullback-Leibler Divergence will be recalculated (with an appropriate logger.warning raised). However, in the case that even the reversed order of samples returns Inf, the Kullback-Leibler Divergence will return a null metric.

 

Model Assumptions


Business Models considered for drift monitoring have a couple of requirements:

  1. An extended schema asset for the input data.

  2. Input data contains at least one numerical column and/or one categorical column. The exact requirement depends on the specific monitor being used.

Model Execution

During execution, drift monitors execute the following:

  1. The init function extracts the extended input schema from job JSON.

  2. monitoring parameters are set based on the schema extracted previously. numerical_columns and categorical_columns are determined accordingly. In the case of concept drift monitoring, target_column (score column) and label_type (numerical vs. categorical) are determined at this step.

  3. The metrics function runs the appropriate drift monitoring test: Epps-SingletonJensen-ShannonKullback-LeiblerKolmogorov-Smirnov, or Pandas.describe(). When the drift monitor (data drift or concept drift) is a comprehensive monitor, all the tests above are performed.

  4. Test results are appended to the list of data_drift or concept_drift tests to be returned by the model, and key-value pairs are added to the top-level of the output dictionary.

For a deeper look at OOTB drift monitors, see the GitHub READMEs: