Statistical Monitoring

This article describes how ModelOp Center enables ongoing Statistical Monitoring.

Table of Contents

 

Introduction

Monitoring a model for its statistical performance is necessary to track whether the model is producing good output (inferences/scores) as compared to actual ground truth. These statistical metrics provide excellent insight into the predictive power of the model, including helping to identify degradation in the model’s ability to predict correctly. These statistical monitors should be run routinely against batches of labeled data and compared against the original metrics produced during training to ensure that the model is performing within specification. If the production statistical metrics deviate beyond a set threshold, then the appropriate alerts are raised for the data scientist or ModelOps engineer to investigate.

ModelOp Center provides a number of statistical monitors out of the box, but also allows you to write your own custom metrics to monitor the statistical performance of the model. The subsequent sections describe how to add a statistical monitor (assuming an out-of-the-box monitor) and the detailed makeup of a statistical monitor for multiple types of models.

Adding Statistical Monitors

As background on the terminology and concepts used in the below, please read the Monitoring Concepts section of the Model overview documentation.

To add a statistical monitor to your model, you will add an existing “associated” model to your model. Below are the steps to accomplish this. For tutorial purposes, these instructions use all out-of-the-box and publicly available content provided by ModelOp, focusing on the Consumer Linear Demo and its related assets.

 

Define thresholds for your model

  1. As mentioned in the Monitoring Concepts article, ModelOp Center uses decision tables to define the thresholds within which the model should operate for the given monitor.

  2. The first step is to define these thresholds. For this tutorial, we will leverage the example Performance-test.dmn decision table. This assumes that the out-of-the-box metrics function in the Consumer Credit Default example model is used, which outputs AUC, ROC, F1, amongst others. Specifically, this decision table ensures that the F1 and AUC from the Consumer Linear Demo model are within specification.

  3. Save the files locally to your machine.

 

Associate Monitor models to snapshot

  1. Navigate to the specific model snapshot

    1. Using the Associated Models widget, create a data drift association

    2. Use the provided data and the DMN you made in step 2.

      1. Use the provided data and the DMN you made in step 2.

    3. Click Save.

    4. The monitor “associated model” will be saved and now ready to run against the model’s specific snapshot

 

Schedule the Monitor

  1. Schedule. Monitors can be scheduled to run using your preferred enterprise scheduling capability (Control-M, Airflow, Autosys, etc.)

    1. While the details will depend on the specific scheduling software, at the highest level, the user simply needs to create a REST call to the ModelOp Center API. Here are the steps:

      1. Obtain the Model snapshot’s unique ID, which can be obtained from the Model snapshot screen. Simply copy the ID from the URL bar:

        1. Example:

      2. Within the scheduler, configure the REST call to ModelOp Center’s automation engine to trigger the monitor for your model:

        1. Obtain a valid auth token

        2. Make a call to the ModelOp Center API to initiate the monitor

        3. Example:

        4. { "name": "com.modelop.mlc.definitions.Signals_MODEL_BACK_TEST", "variables": { "MODEL_ID": { "value": "FILL-IN-SNAPSHOT-GUID" } } }
      3. For more details on triggering monitors, visit the article Triggering Metrics Tests.

  2. Monitoring Execution: once the scheduler triggers the monitoring job, the relevant model life cycle will initiated the specific monitor, which likely includes:

    1. Preparing the monitoring job with all artifacts necessary to run the job

    2. Creating the monitoring job

    3. Parsing the results into viewable test results

    4. Comparing the results against the thresholds in the decision table

    5. Taking action, which could include creating a notification and/or opening up an incident in JIRA/ServiceNow/etc.

 

Viewing Monitoring Notifications

  1. Typically, the model life cycle that runs the monitor will create notifications, such as:

    1. A monitor has been started

    2. A monitor has run successfully

    3. A monitor’s output (model test) has failed

    4. These Notifications can be viewed in the home page of ModelOp Center’s UI:

 

Viewing Monitoring Job Results

  1. All monitor job results are persisted and can be viewed directly by clicking the specific “result” in the “Model Tests” section of the model snapshot page:

Statistical Monitor Details

While the types of statistical metrics used will vary based on the type of model (e.g. regression vs. classification), typically a small set of statistical metrics “monitors” can be created and then simply associated to several models. This association is made during the Model Lifecycle process. The statistical metrics monitor would require labeled data in order to output the required metrics.

The following is a simple example of how statistical metrics would be calculated for the Consumer Credit Default public model:

 

import pandas as pd import numpy as np from sklearn.metrics import roc_auc_score, roc_curve, f1_score, confusion_matrix #modelop.metrics def metrics(data): metrics = {} prep_data = preprocess(data) data = pd.concat([data, prep_data], axis=1) data.loc[:, 'probabilities'] = prediction(data) data.loc[:, 'predictions'] = data.probabilities \ .apply(lambda x: threshold > x) \ .astype(int) if is_validated(data): f1 = f1_score(data.loan_status, data.predictions) cm = confusion_matrix(data.loan_status, data.predictions) labels = ['Fully Paid', 'Charged Off'] cm = matrix_to_dicts(cm, labels) fpr, tpr, thres = roc_curve(data.loan_status, data.probabilities) auc_val = roc_auc_score(data.loan_status, data.probabilities) rc = [{'fpr': x[0], 'tpr': x[1]} for x in list(zip(fpr, tpr))] metrics['f1_score'] = f1 metrics['confusion_matrix'] = cm metrics['auc'] = auc_val metrics['ROC'] = rc yield metrics

This statistical metrics monitor outputs a standard set of metrics including the F1 score, confusion matrix, AUC, and ROC. These are not the only metrics that are available. Rather, a user can add any additional metric--including custom-defined metrics with nested objects--in ModelOp Center as long as they are defined as key/value pairs. However, note that there are a few metrics that are pre-defined in ModelOp Center that will be plotted in the Test Results page based on the following keys in the key/value pairs:

  • “ROC” - the ROC curve will be plotted using the fpr and tpr values

  • “confusion_matrix” - a confusion matrix will be plotted based on the provided values

  • “shap” - a graph of the shap values by feature will be plotted based on the provided values

  • “bias” - a plot of ethical fairness metrics will be generated using the protected entities and related ethical fairness measures

Spark Models

A similar statistical evaluation method may be used for PySpark models with HDFS assets by parsing the HDFS asset URLs from the parameters of the metrics function. The following is a simple example:

from pyspark.sql import SparkSession from pyspark.sql.functions import col from pyspark.sql.functions import isnull, when, count from pyspark.sql.functions import udf from pyspark.sql.types import ArrayType, FloatType from pyspark.ml.feature import StringIndexer from pyspark.ml.feature import VectorAssembler from pyspark.ml.classification import RandomForestClassificationModel from pyspark.ml.evaluation import MulticlassClassificationEvaluator # modelop.init def begin(): print("Begin function...") global SPARK SPARK = SparkSession.builder.appName("DriftTest").getOrCreate() global MODEL MODEL = RandomForestClassificationModel.load("/hadoop/demo/titanic-spark/titanic") # modelop.metrics def metrics(external_inputs, external_outputs, external_model_assets): # Grab single input asset and single output asset file paths input_asset_path = external_inputs[0]["fileUrl"] output_asset_path = external_outputs[0]["fileUrl"] input_df = SPARK.read.format("csv").option("header", "true").load(input_asset_path) predictions = predict(input_df) # Select (prediction, true label) and compute test error evaluator = MulticlassClassificationEvaluator( labelCol="Survived", predictionCol="prediction", metricName="accuracy" ) accuracy = evaluator.evaluate(predictions) output_df = SPARK.createDataFrame([{"accuracy": accuracy}]) print("Metrics output:") output_df.show() output_df.coalesce(1).write.mode("overwrite").option("header", "true").format( "json" ).save(output_asset_path) SPARK.stop() def predict(input_df): dataset = input_df.select( col("Survived").cast("float"), col("Pclass").cast("float"), col("Sex"), col("Age").cast("float"), col("Fare").cast("float"), col("Embarked"), ) dataset = dataset.replace("?", None).dropna(how="any") dataset = ( StringIndexer(inputCol="Sex", outputCol="Gender", handleInvalid="keep") .fit(dataset) .transform(dataset) ) dataset = ( StringIndexer(inputCol="Embarked", outputCol="Boarded", handleInvalid="keep") .fit(dataset) .transform(dataset) ) dataset = dataset.drop("Sex") dataset = dataset.drop("Embarked") required_features = ["Pclass", "Age", "Fare", "Gender", "Boarded"] assembler = VectorAssembler(inputCols=required_features, outputCol="features") transformed_data = assembler.transform(dataset) predictions = MODEL.transform(transformed_data) return predictions

This model uses a Spark MulticlassClassificationEvaluator to determine the accuracy of the predictions generated by the titanic model.

Next Article: Ethical Fairness Monitoring >