Model Monitoring: Overview

Introduction

ModelOp Center provides comprehensive operational, quality, risk, and process monitoring throughout the entire life cycle of a model. ModelOp Center uses the concept of an “associated model” that allows the user to “associate” specific monitors for the model and run these monitors routinely--either on a scheduled or triggered basis. ModelOp Center ships with a number of monitors out of the box, which the user can select and use without modification. Additionally, the user may decide to write his/her own custom monitoring function which can be registered as an associated model and set to run for the user’s model. This gives the enterprise the flexibility to select the best metrics to monitor their unique requirements from a business, technical, and risk perspective. Furthermore, these monitors are integrated into model life cycles, allowing the user to not only observe issues via the monitor, but to automatically compare the monitor outcomes against model-specific thresholds and take remediation action if there are deviations.

 

The subsequent sections provide an overview of monitor selection as well as how to test monitors within ModelOp Center. Subsequent articles go into detail on enabling statistical monitoring, drift monitoring, and ethical fairness monitoring.

Monitoring Concepts

As background, ModelOp Center treats all “monitors” as models themselves, which allows for reuse and robust governance and auditability around these critical monitors that are ensuring that an enterprise’s decisioning assets are performing optimally and within governance thresholds.

Additionally, ModelOp Center uses decision tables to determine if a model is running within the desired thresholds. Decision tables are an industry standard approach to allow for defining various rules by which a decision should be made. ModelOp specifically chose to incorporate decision tables for monitoring as our experience has shown that there are a number of factors that weigh into whether a model is actually having an issue, often combining technical, statistical, business, and other metadata to ascertain if the model is operating out of bounds. ModelOp Center provides data scientists and ModelOps engineers the flexibility to incorporate these varying requirements to provide more precise monitoring and alerting when a model begins operating out of specification.

Choosing Evaluation Metrics

To test the efficacy of a model, a metric should be chosen during model development and used to benchmark the model. The chosen metric should reflect the underlying business problem. For instance, in a binary classification problem with very unbalanced class frequencies, accuracy is a poor choice of metric. A “model” which always predicts that the more common class will occur will be very accurate, but will not do a good job of predicting the less frequent class.

Take compliance in internal communications as an example. Very few internal communications may be non-compliant, but a model which never flags possible non-compliance is worthless even if it is highly accurate. A better metric, in this case, is an F1 score or an Fβ score for β> 1 more generally. The latter will reward the model more for true positives and punish the model for false negatives, occurrences where the model fails to detect non-compliant communication. 

Similarly, for regression problems, the data scientist should decide on a metric based on whether a few bad errors with most being small is preferable in which case she should use mean absolute error (MAE); or whether no errors should exceed a particular threshold in which case the data scientist should use the max error. A metric like a root mean squared error (RMSE) interpolates between these cases. 
There are metrics for every type of problem: multi-class classification, all varieties of regression, unsupervised clustering, etc. They can range from quite simple to quite intricate, but whatever the problem, a metric should be decided upon early in development and used to test a model as it is promoted to UAT and then into production. Here are some tests it might encounter along the way.

  • The F1 score is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples 

  • SHAP values (interpretability), are used on a per-record basis to justify why a particular record or client got the score they did. This makes SHAP fit into the action/scoring function more than it does in the Metrics Function

  • The ROC Curve to determine the ratio of true positives to false positives

  • The AUC (Area Under the ROC Curve)

Note: There can be other items that determine which model to promote to production. For example, the situation may favor a model with better inference speed, interpretability, etc.

Out of the Box Metrics

ModelOp Center ships with multiple out-of-the-box monitors, which are registered as associated models. The user may add one of these associated monitors to his/her model or decide to write a custom metric function (see next section). Here is a sampling of out of the box monitors across 4 categories:

Operational Performance:

Automatically monitor model operations to ensure that models are running at agreed-upon service levels and delivering decisions at the rate expected. Operational performance monitors include:

  • Model availability and SLA performance

  • Data throughput and latency with inference execution

  • Volume and frequency of input requests for the

    application

  • Input data adherence to the defined schema for model

  • Input data records for inferences are within established

    range

Quality Performance:

Ensure that model decisions and outcomes are within established data quality controls, eliminating the risk of unexpected and inaccurate decisions. Quality performance monitors include:

  • Data drift of input data

  • Concept drift of output

  • Statistical effectiveness of model output

Risk Performance

Controlling risk and ensuring models are constantly operating within established business risk and compliance ranges as well as delivering ethically fair results is a constant challenge. Prevent out-of-compliance issues with automated, continuous risk performance monitoring. Risk performance monitors include:

  • Ethical fairness of model output

  • Interpretability of model features weighting

Process Performance:

Continuous monitoring of the end-to-end model operations process ensures that all steps are properly executed and adhered to. Collect and retain data for each step in the model life cycle, resulting in reproducibility and auditability. Process performance monitors include:

  • Registration processes

  • Operational processes

  • Monitoring processes

  • Governance processes

Writing a Custom Monitor (Metric Function)

The Metrics Function allows you to define custom metrics that you would like to monitor for your model. This metrics function would be included in the source code that is registered as a model in ModelOp Center and then added as an associated model for monitoring. You can use the Metrics Job to manually execute this script against data, or use an MLC Process to trigger automatic execution. See Model Batch Jobs and Tests for more information.

You can specify a Metrics Function either with a # modelop.metrics smart tag comment before the function definition or you can select it within the Command Center after the model source code is registered. The Metrics Function executes against a batch of records and yields test results as a JSON object of the form {“metric_1”: <value_1>, …, “metric_n”: <value_n>}. These values are used to populate the Test Results visuals within the UI (as seen at the bottom of this page).

Here is an example of how to code a Metrics Function. It is calculating the ROC Curve, AUC, F2, and the Confusion Matrix.

# modelop.metrics def metrics(x): lasso_model = lasso_model_artifacts['lasso_model'] dictionary = lasso_model_artifacts['dictionary'] threshold = lasso_model_artifacts['threshold'] tfidf_model = lasso_model_artifacts['tfidf_model'] actuals = x.flagged cleaned = preprocess(x.content) corpus = cleaned.apply(dictionary.doc2bow) corpus_sparse = gensim.matutils.corpus2csc(corpus).transpose() corpus_sparse_padded = pad_sparse_matrix(sp_mat = corpus_sparse, length=corpus_sparse.shape[0], width = len(dictionary)) tfidf_vectors = tfidf_model.transform(corpus_sparse_padded) probabilities = lasso_model.predict_proba(tfidf_vectors)[:,1] predictions = pd.Series(probabilities > threshold, index=x.index).astype(int) confusion_matrix = sklearn.metrics.confusion_matrix(actuals, predictions) fpr,tpr,thres = sklearn.metrics.roc_curve(actuals, predictions) auc_val = sklearn.metrics.auc(fpr, tpr) f2_score = sklearn.metrics.fbeta_score(actuals, predictions, beta=2) roc_curve = [{'fpr': x[0], 'tpr':x[1]} for x in list(zip(fpr, tpr))] labels = ['Compliant', 'Non-Compliant'] cm = matrix_to_dicts(confusion_matrix, labels) test_results = dict( roc_curve=roc_curve, auc=auc_val, f2_score=f2_score, confusion_matrix=cm ) yield test_results

Here is an example of expected output from this function:

{ "roc_curve": [ {"fpr": 0.0, "tpr": 0.0}, {"fpr": 0.026, "tpr": 0.667}, {"fpr": 1.0, "tpr": 1.0} ], "auc": 0.821, "f2_score": 0.625, "confusion_matrix": [ {"Compliant": 76, "Non-Compliant": 2}, {"Compliant": 1, "Non-Compliant": 2} ] }

Running a Monitor (Metrics Job) Manually

Run a Metric Job Manually from the CLI

  1. To create a ‘metrics job’ from the CLI, use the command

moc job create testjob <deployable-model-uuid> <input-file-name> <output-file-name> optional-flags

  1. This command yields a UUID for the job.

  2. To find the raw JSON results of the job, use the command

moc job result <uuid>

Run a Metric Job Manually from the Command Center UI

See Manually Create a Batch Job in the Command Center.

View the results of a Monitor (Metrics Job)

  1. To see the results of a monitor or test, navigate to Models and select the model whose tests you would like to view.

 

2. Select the Snapshot (version) of the model to view the Test Results.

 

3. Click on the test results for that version.

4. The following example measures consumer credit classification including the ROC.

 

Alerting & Notifications

Alerts, Tasks, and Notifications Messages provide visibility into information and actions that need to be taken as a result of model monitoring. These “messages” are surfaced through the Command Center Dashboard but typically are also tied into enterprise ticketing systems such as ServiceNow and/or JIRA.

The types of messages generated from Model Monitoring include:

  • Alerts - test failures, model errors, runtime issues, and other situations that require a response.

    • Alerts are automatically raised by system monitors or as the output of monitor comparison in a model life cycle.

  • Tasks - user tasks such as approve a model, acknowledge a failed test, etc.

    • For details about viewing and responding to test failures.

  • Notifications - includes system status, runtime status and errors, model errors, and other information generated by ModelOp Center automatically.

 

Next Article: Operational Monitoring >