Monitoring & Reporting - Key Concepts
This article provides an overview of ModelOp Center’s Model Monitoring approach, including the use of various metrics to enable comprehensive monitoring throughout the life cycle of a model.
Table of Contents
Introduction
ModelOp Center provides comprehensive quality, risk, and process monitoring throughout the entire life cycle of a model. ModelOp Center uses the concept of an “associated model” that allows the user to “associate” specific tests/monitors for the model implementation and run these monitors routinely--either on a scheduled or triggered basis. Monitors are associated models that can be tied to one or more business models (model implementations).
ModelOp Center ships with a number of monitors out of the box, which the user can select and use without modification. Additionally, the user may decide to write his/her own custom monitoring function which can be registered as an associated model and set to run for the user’s model. ModelOp also provides a monitoring SDK in the form of a Python package to assist in writing custom monitoring functions or supplementing the out of the box monitors. This gives the enterprise the flexibility to select the best metrics to monitor their unique requirements from a business, technical, and risk perspective. Furthermore, these monitors are integrated into model life cycles, allowing the user to not only observe issues via the monitor, but to automatically compare the monitor outcomes against model-specific thresholds and take remediation action if there are deviations.
The subsequent sections provide an overview of the key concepts of How ModelOp provides comprehensive testing and monitoring.
Monitoring Concepts
As background, ModelOp Center treats all “monitors” as models themselves, which allows for reuse and robust governance and auditability around these critical monitors that are ensuring that an enterprise’s decisioning assets are performing optimally and within governance thresholds.
Additionally, ModelOp Center uses decision tables to determine if a model is running within the desired thresholds. Decision tables are an industry standard approach to allow for defining various rules by which a decision should be made. ModelOp specifically chose to incorporate decision tables for monitoring as our experience has shown that there are a number of factors that weigh into whether a model is actually having an issue, often combining technical, statistical, business, and other metadata to ascertain if the model is operating out of bounds. ModelOp Center provides data scientists and ModelOps engineers the flexibility to incorporate these varying requirements to provide more precise monitoring and alerting when a model begins operating out of specification.
Selecting Evaluation Metrics
To test the efficacy of a model, a metric should be chosen during model development and used to benchmark the model. The chosen metric should reflect the underlying business problem. For instance, in a binary classification problem with very unbalanced class frequencies, accuracy is a poor choice of metric. A “model” which always predicts that the more common class will occur will be very accurate, but will not do a good job of predicting the less frequent class.
Take compliance in internal communications as an example. Very few internal communications may be non-compliant, but a model which never flags possible non-compliance is worthless even if it is highly accurate. A better metric, in this case, is an F1 score or an Fβ score for β> 1 more generally. The latter will reward the model more for true positives and punish the model for false negatives, occurrences where the model fails to detect non-compliant communication.
Similarly, for regression problems, the data scientist should decide on a metric based on whether a few bad errors with most being small is preferable in which case she should use mean absolute error (MAE); or whether no errors should exceed a particular threshold in which case the data scientist should use the max error. A metric like a root mean squared error (RMSE) interpolates between these cases.
There are metrics for every type of problem: multi-class classification, all varieties of regression, unsupervised clustering, etc. They can range from quite simple to quite intricate, but whatever the problem, a metric should be decided upon early in development and used to test a model as it is promoted to UAT and then into production. Here are some tests it might encounter along the way.
The F1 score is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples
SHAP values (interpretability), are used on a per-record basis to justify why a particular record or client got the score they did. This makes SHAP fit into the action/scoring function more than it does in the Metrics Function
The ROC Curve to determine the ratio of true positives to false positives
The AUC (Area Under the ROC Curve)
Note: There can be other items that determine which model to promote to production. For example, the situation may favor a model with better inference speed, interpretability, etc.
Preparing your Model for Tests/Monitors
While input assets vary based on the monitor, typically there are a few key items that need to be configured for your Model in order to run tests/monitors within ModelOp Center:
Model-Specific Data Sets: typically training/baseline data, test data, holdout data, and/or production data sets. These are the specific data sets for the Model Implementation that are fed into the test/monitor to calculate the given metrics. See https://modelopdocs.atlassian.net/wiki/spaces/dv33/pages/1978435029 for more details on adding data sets to your Model Implementation. In particular with data sets, it is recommended to include a timestamp so that metrics can be calculated for a given period (e.g. day), thus allowing for analyzing trends over time.
Schema: the schema defines the specific inputs (features) and outputs for a given Model Implementation, and the role/type/attributes for each field in the schema. See https://modelopdocs.atlassian.net/wiki/spaces/dv33/pages/1978434718 for more details on adding schemas to your Model Implementation.
Once these items are configured in your model implementation, it is easy to set up and run most tests/monitors within ModelOp Center
Alerting & Notifications
Alerts, Tasks, and Notifications Messages provide visibility into information and actions that need to be taken as a result of model monitoring. These “messages” are surfaced throughout the ModelOp Command Center UI, but typically are also tied into enterprise ticketing systems such as ServiceNow and/or JIRA.
The types of messages generated from Model Monitoring include:
Alerts - test failures, model errors, runtime issues, and other situations that require a response.
Alerts are automatically raised by system monitors or as the output of monitor comparison in a model life cycle.
Tasks - user tasks such as approve a model, acknowledge a failed test, etc.
For details about viewing and responding to test failures.
Notifications - includes system status, runtime status and errors, model errors, and other information generated by ModelOp Center automatically.
Next Article: https://modelopdocs.atlassian.net/wiki/spaces/dv33/pages/2051899394 >