This article describes how ModelOp Center enables ongoing Drift and Model Concept Drift Monitoring.

Table of Contents

Introduction

Monitoring incoming data for statistical drift is necessary to track whether assumptions made during model development are still valid in a production setting. For instance, a data scientist may assume that the values of a particular feature are normally distributed or the choice of encoding of a certain categorical variable may have been made with a certain multinomial distribution in mind. Tests should be run routinely against batches of live data and compared against the distribution of the training data to ensure that these assumptions are still valid, and if the tests fail, then appropriate alerts are raised for the data scientist or ModelOps engineer to investigate.

ModelOp Center provides a number of Drift monitors out of the box, but also allows you to write your own drift monitor. The subsequent sections describe how to add a drift monitor (assuming an out-of-the-box monitor) and the detailed makeup of a drift monitor for multiple types of models.

Adding Drift Monitors

As background on the terminology and concepts used in the below, please read the Monitoring Concepts section of the Model overview documentation.

To add a drift monitor to your model, you will add an existing “associated” model to your model. Below are the steps to accomplish this. For tutorial purposes, these instructions use all out-of-the-box and publicly available content provided by ModelOp, focusing on the Consumer Linear Demo and its related assets.

Define thresholds for your model

As mentioned in the Monitoring Concepts article, ModelOp Center uses decision tables to define the thresholds within which the model should operate for the given monitor.
The first step is to define these thresholds. For this tutorial, we will leverage the example Data-drift.dmn decision table. This assumes that the out-of-the-box Data Drift Detector is used, which leverages Kolmorgov-Smirnoff to calculate changes in the distributions between production and training data, outputting p-values. Specifically, this drift detector ensures that the below critical features from the Consumer Linear Demo model are within specification.
Repeat for the provided Concept-drift.dmn file Performance-test.dmn
Save the files locally to your machine.

Associate Monitor models to snapshot

Navigate to the specific model snapshot
1. Using the Associated Models widget, create a data drift association
3. Use the provided data and the DMN you made in step 2.
  1. Use the provided data and the DMN you made in step 2.
4. Click Save.
5. The monitor “associated model” will be saved and now ready to run against the model’s specific snapshot

Schedule the Monitor

Schedule. Monitors can be scheduled to run using your preferred enterprise scheduling capability (Control-M, Airflow, Autosys, etc.)
1. While the details will depend on the specific scheduling software, at the highest level, the user simply needs to create a REST call to the ModelOp Center API. Here are the steps:
  1. Obtain the Model snapshot’s unique ID, which can be obtained from the Model snapshot screen. Simply copy the ID from the URL bar:
    1. Example:
  2. Within the scheduler, configure the REST call to ModelOp Center’s automation engine to trigger the monitor for your model:
    1. Obtain a valid auth token
    2. Make a call to the ModelOp Center API to initiate the monitor
    3. Example:
      { "name": "com.modelop.mlc.definitions.Signals_MODEL_DATA_DRIFT_TEST", "variables": { "MODEL_ID": { "value": "FILL-IN-SNAPSHOT-GUID" } } }
  3. For more details on triggering monitors, visit the article Triggering Metrics Tests.
Monitoring Execution: once the scheduler triggers the monitoring job, the relevant model life cycle will initiated the specific monitor, which likely includes:
1. Preparing the monitoring job with all artifacts necessary to run the job
2. Creating the monitoring job
3. Parsing the results into viewable test results
4. Comparing the results against the thresholds in the decision table
5. Taking action, which could include creating a notification and/or opening up an incident in JIRA/ServiceNow/etc.

Viewing Monitoring Notifications

Typically, the model life cycle that runs the monitor will create notifications, such as:
1. A monitor has been started
2. A monitor has run successfully
3. A monitor’s output (model test) has failed
4. These Notifications can be viewed in the home page of ModelOp Center’s UI:

Viewing Monitoring Job Results

All monitor job results are persisted and can be viewed directly by clicking the specific “result” in the “Model Tests” section of the model snapshot page:

Drift Monitor Details

As the same data set may serve several models, you can write one drift detection model to associate to several models. This association is made during the Model Lifecycle process. The drift model can compare the training data of the associated models to a given batch of data. The following is a simple example:

import pandas as pd
import numpy as np
from scipy.stats import ks_2samp
from scipy.stats import binom_test


# modelop.init
def begin():
    """
    A function to read training data and save it, along with it
    numerical features, globally.
    """
    
    global train, numerical_features
    
    train = pd.read_csv('training_data.csv')
    numerical_features = train.select_dtypes(['int64', 'float64']).columns


# modelop.metrics
def metrics(data):
    """
    A function to compute KS p-values on input (sample) data
    as compared to training (baseline data)
    """
    
    ks_tests = [ks_2samp(train.loc[:, feat], data.loc[:, feat]) \
                for feat in numerical_features]
    pvalues = [x[1] for x in ks_tests]
    ks_pvalues = dict(zip(numerical_features, pvalues))
    
    yield dict(pvalues=ks_pvalues)

This drift model executes a two-sample Kolmogorov-Smirnov test between numerical features of the training data and the incoming batch and reports the p-values. If the p-values are sufficiently large (over 0.01 or 0.05), you can assume that the two samples are similar. If the p-values are small, you can assume that these samples are different and generate an alert.

If the training data is too large to fit in memory, you can save summary statistics about the training data and save those as, e.g., a pickle file and read those statistics in during the init function of the drift model. The metrics function can contain other statistical tests to compare those statistics to the statistics of the incoming batch.

Spark Drift Model Details

A similar drift detection method may be used for PySpark models with HDFS assets by parsing the HDFS asset URLs from the parameters of the metrics function. The following is a simple example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.functions import isnull, when, count
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, FloatType
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassificationModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


# modelop.init
def begin():
    print("Begin function...")

    global SPARK
    SPARK = SparkSession.builder.appName("DriftTest").getOrCreate()

    global MODEL
    MODEL = RandomForestClassificationModel.load("/hadoop/demo/titanic-spark/titanic")
    
# modelop.metrics
def metrics(external_inputs, external_outputs, external_model_assets):
    # Grab single input asset and single output asset file paths
    input_asset_path = external_inputs[0]["fileUrl"]
    output_asset_path = external_outputs[0]["fileUrl"]
    
    input_df = SPARK.read.format("csv").option("header", "true").load(input_asset_path)
    predictions = predict(input_df)

    # Select (prediction, true label) and compute test error
    evaluator = MulticlassClassificationEvaluator(
        labelCol="Survived", predictionCol="prediction", metricName="accuracy"
    )
    accuracy = evaluator.evaluate(predictions)

    output_df = SPARK.createDataFrame([{"accuracy": accuracy}])
    print("Metrics output:")
    output_df.show()

    output_df.coalesce(1).write.mode("overwrite").option("header", "true").format(
        "json"
    ).save(output_asset_path)

    SPARK.stop()
    
def predict(input_df):
    dataset = input_df.select(
        col("Survived").cast("float"),
        col("Pclass").cast("float"),
        col("Sex"),
        col("Age").cast("float"),
        col("Fare").cast("float"),
        col("Embarked"),
    )

    dataset = dataset.replace("?", None).dropna(how="any")

    dataset = (
        StringIndexer(inputCol="Sex", outputCol="Gender", handleInvalid="keep")
        .fit(dataset)
        .transform(dataset)
    )

    dataset = (
        StringIndexer(inputCol="Embarked", outputCol="Boarded", handleInvalid="keep")
        .fit(dataset)
        .transform(dataset)
    )

    dataset = dataset.drop("Sex")
    dataset = dataset.drop("Embarked")

    required_features = ["Pclass", "Age", "Fare", "Gender", "Boarded"]

    assembler = VectorAssembler(inputCols=required_features, outputCol="features")
    transformed_data = assembler.transform(dataset)

    predictions = MODEL.transform(transformed_data)
    
    return predictions

This model uses a Spark MulticlassClassificationEvaluator to determine the accuracy of the predictions generated by the titanic model.

Next Article: Statistical Monitoring >

ModelOp Documentation Hub V2.3

Drift & Concept Drift Monitoring