Model Monitoring: Overview
This article provides an overview of ModelOp Center’s Model Monitoring approach, including the use of various metrics to enable comprehensive monitoring throughout the life cycle of a model.
Table of Contents
Introduction
ModelOp Center provides comprehensive operational, quality, risk, and process monitoring throughout the entire life cycle of a model. ModelOp Center uses the concept of an “associated model” that allows the user to “associate” specific monitors for the model and run these monitors routinely--either on a scheduled or triggered basis. Monitors are associated models that can be tied to one or more business models, or “base models”.
ModelOp Center ships with a number of monitors out of the box, which the user can select and use without modification. Additionally, the user may decide to write his/her own custom monitoring function which can be registered as an associated model and set to run for the user’s model. ModelOp also provides a monitoring SDK in the form of a Python package to assist in writing custom monitoring functions or supplementing the out of the box monitors. This gives the enterprise the flexibility to select the best metrics to monitor their unique requirements from a business, technical, and risk perspective. Furthermore, these monitors are integrated into model life cycles, allowing the user to not only observe issues via the monitor, but to automatically compare the monitor outcomes against model-specific thresholds and take remediation action if there are deviations.
The subsequent sections provide an overview of monitor selection as well as how to test monitors within ModelOp Center. Subsequent articles go into detail on enabling statistical monitoring, drift monitoring, and ethical fairness monitoring.
Monitoring Concepts
As background, ModelOp Center treats all “monitors” as models themselves, which allows for reuse and robust governance and auditability around these critical monitors that are ensuring that an enterprise’s decisioning assets are performing optimally and within governance thresholds.
Additionally, ModelOp Center uses decision tables to determine if a model is running within the desired thresholds. Decision tables are an industry standard approach to allow for defining various rules by which a decision should be made. ModelOp specifically chose to incorporate decision tables for monitoring as our experience has shown that there are a number of factors that weigh into whether a model is actually having an issue, often combining technical, statistical, business, and other metadata to ascertain if the model is operating out of bounds. ModelOp Center provides data scientists and ModelOps engineers the flexibility to incorporate these varying requirements to provide more precise monitoring and alerting when a model begins operating out of specification.
Choosing Evaluation Metrics
To test the efficacy of a model, a metric should be chosen during model development and used to benchmark the model. The chosen metric should reflect the underlying business problem. For instance, in a binary classification problem with very unbalanced class frequencies, accuracy is a poor choice of metric. A “model” which always predicts that the more common class will occur will be very accurate, but will not do a good job of predicting the less frequent class.
Take compliance in internal communications as an example. Very few internal communications may be non-compliant, but a model which never flags possible non-compliance is worthless even if it is highly accurate. A better metric, in this case, is an F1 score or an Fβ score for β> 1 more generally. The latter will reward the model more for true positives and punish the model for false negatives, occurrences where the model fails to detect non-compliant communication.
Similarly, for regression problems, the data scientist should decide on a metric based on whether a few bad errors with most being small is preferable in which case she should use mean absolute error (MAE); or whether no errors should exceed a particular threshold in which case the data scientist should use the max error. A metric like a root mean squared error (RMSE) interpolates between these cases.
There are metrics for every type of problem: multi-class classification, all varieties of regression, unsupervised clustering, etc. They can range from quite simple to quite intricate, but whatever the problem, a metric should be decided upon early in development and used to test a model as it is promoted to UAT and then into production. Here are some tests it might encounter along the way.
The F1 score is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples
SHAP values (interpretability), are used on a per-record basis to justify why a particular record or client got the score they did. This makes SHAP fit into the action/scoring function more than it does in the Metrics Function
The ROC Curve to determine the ratio of true positives to false positives
The AUC (Area Under the ROC Curve)
Note: There can be other items that determine which model to promote to production. For example, the situation may favor a model with better inference speed, interpretability, etc.
Out of the Box Metrics
ModelOp Center ships with multiple out-of-the-box monitors, which are registered as associated models. The user may add one of these associated monitors to his/her model or decide to write a custom metric function (see next section). These monitors can also be customized via the ModelOp monitoring Python package. See here for documentation on the monitoring package. Here is a sampling of out of the box monitors across 4 categories:
Operational Performance:
Automatically monitor model operations to ensure that models are running at agreed-upon service levels and delivering decisions at the rate expected. Operational performance monitors include:
Model availability and SLA performance
Data throughput and latency with inference execution
Volume and frequency of input requests for the
application
Input data adherence to the defined schema for model
Input data records for inferences are within established
range
Quality Performance:
Ensure that model decisions and outcomes are within established data quality controls, eliminating the risk of unexpected and inaccurate decisions. Quality performance monitors include:
Data drift of input data
Concept drift of output
Statistical effectiveness of model output
Risk Performance
Controlling risk and ensuring models are constantly operating within established business risk and compliance ranges as well as delivering ethically fair results is a constant challenge. Prevent out-of-compliance issues with automated, continuous risk performance monitoring. Risk performance monitors include:
Ethical fairness of model output
Interpretability of model features weighting
Process Performance:
Continuous monitoring of the end-to-end model operations process ensures that all steps are properly executed and adhered to. Collect and retain data for each step in the model life cycle, resulting in reproducibility and auditability. Process performance monitors include:
Registration processes
Operational processes
Monitoring processes
Governance processes
Writing a Custom Monitor (Metric Function)
The Metrics Function allows you to define custom metrics that you would like to monitor for your model. This metrics function would be included in the source code that is registered as a model in ModelOp Center and then added as an associated model for monitoring. You can use the Metrics Job to manually execute this script against data, or use an MLC Process to trigger automatic execution. See Model Batch Jobs and Tests for more information.
You can specify a Metrics Function either with a # modelop.metrics
smart tag comment before the function definition or you can select it within the Command Center after the model source code is registered. The Metrics Function executes against a batch of records and returns test results as a JSON object of the form {“metric_1”: <value_1>, …, “metric_n”: <value_n>}
. These values are used to populate the Test Results visuals within the UI (as seen at the bottom of this page).
Here is an example of how to code a Metrics Function. It is calculating the ROC Curve, AUC, F2, and the Confusion Matrix.
# modelop.metrics
def metrics(x):
lasso_model = lasso_model_artifacts['lasso_model']
dictionary = lasso_model_artifacts['dictionary']
threshold = lasso_model_artifacts['threshold']
tfidf_model = lasso_model_artifacts['tfidf_model']
actuals = x.flagged
cleaned = preprocess(x.content)
corpus = cleaned.apply(dictionary.doc2bow)
corpus_sparse = gensim.matutils.corpus2csc(corpus).transpose()
corpus_sparse_padded = pad_sparse_matrix(sp_mat = corpus_sparse,
length=corpus_sparse.shape[0],
width = len(dictionary))
tfidf_vectors = tfidf_model.transform(corpus_sparse_padded)
probabilities = lasso_model.predict_proba(tfidf_vectors)[:,1]
predictions = pd.Series(probabilities > threshold, index=x.index).astype(int)
confusion_matrix = sklearn.metrics.confusion_matrix(actuals, predictions)
fpr,tpr,thres = sklearn.metrics.roc_curve(actuals, predictions)
auc_val = sklearn.metrics.auc(fpr, tpr)
f2_score = sklearn.metrics.fbeta_score(actuals, predictions, beta=2)
roc_curve = [{'fpr': x[0], 'tpr':x[1]} for x in list(zip(fpr, tpr))]
labels = ['Compliant', 'Non-Compliant']
cm = matrix_to_dicts(confusion_matrix, labels)
test_results = dict(
roc_curve=roc_curve,
auc=auc_val,
f2_score=f2_score,
confusion_matrix=cm
)
return test_results
Here is an example of expected output from this function:
{
"roc_curve":
[
{"fpr": 0.0, "tpr": 0.0},
{"fpr": 0.026, "tpr": 0.667},
{"fpr": 1.0, "tpr": 1.0}
],
"auc": 0.821,
"f2_score": 0.625,
"confusion_matrix":
[
{"Compliant": 76, "Non-Compliant": 2},
{"Compliant": 1, "Non-Compliant": 2}
]
}
Adding A Monitor
Generate a Schema
Out of the box monitoring models use extended Avro schemas attached to business models to determine the characteristics of input data for monitoring runs, such as identifier columns, weight columns, score columns, and others. Schemas can be generated via the ModelOp Center UI or they can be imported with the business model via GitHub or Bitbucket.
To generate a schema, go to the Business Model and click the “Schemas” tab, then click the “Generate Extended Schema” button. A UI will open to aid with generating a schema:
To infer an extended schema from a dataset, a user can either upload a JSON file with some sample records or paste sample records into the top text box. A generated schema will appear in the preview on the bottom text box after clicking “Generate Schema”:
This schema can then be downloaded and uploaded to the git repository that backs the business model, and ModelOp Center will track the schema.
Note: for certain models that may not be backed by git, it is possible to save the generated schema to the business model via the “Save as Input Schema” or “Save as Output Schema” options. However, it is strongly recommended to keep schema files in source code control.
Running a Monitor Manually
Run a Monitor from ModelOp Center UI
After adding a monitor to a business model’s snapshot (see the “Adding a Monitor” section above), the Play button next to the monitor can be clicked to run an ad hoc monitoring job:
Upon successful initiation of the Monitoring job, the user will be directed to the specific Monitoring Job’s job details page, where the user can see the actual monitor execution and results:
Run a Metric Job Manually from the CLI
To create a ‘metrics job’ from the CLI, use the command
moc job create testjob <deployable-model-uuid> <input-file-name> <output-file-name> optional-flags
This command yields a UUID for the job.
To find the raw JSON results of the job, use the command
moc job result <uuid>
Run a Metrics Job from the ModelOp Center UI
See Manually Create a Batch Job in the Command Center.
Viewing the Results of a Monitoring Job
To see the results of a monitor or test, navigate to Model Snapshot page and select the Monitoring tab:
Individual Monitor Test Results
1. Click on the individual test result of interest:
2. The test result details are displayed:
ModelOp Center supports a variety of visualizations for data science metrics for out of the box monitors. These visualizations can be added to custom monitors by following the standard metrics format as outlined in this documentation.
Monitoring Results over Time
1. Click on the “Results over Time” button:
2. All of the monitoring test results will be plotted over time (assuming the metric can be plotted).
Alerting & Notifications
Alerts, Tasks, and Notifications Messages provide visibility into information and actions that need to be taken as a result of model monitoring. These “messages” are surfaced throughout the ModelOp Command Center UI, but typically are also tied into enterprise ticketing systems such as ServiceNow and/or JIRA.
The types of messages generated from Model Monitoring include:
Alerts - test failures, model errors, runtime issues, and other situations that require a response.
Alerts are automatically raised by system monitors or as the output of monitor comparison in a model life cycle.
Tasks - user tasks such as approve a model, acknowledge a failed test, etc.
For details about viewing and responding to test failures.
Notifications - includes system status, runtime status and errors, model errors, and other information generated by ModelOp Center automatically.
Next Article: Operational Monitoring >