Introduction
Testing the efficacy of a model and testing it operationally is important to ensure the quality of the predictions the model provides to business. When new production data has statistical properties that violate the assumptions made at the time of the original creation, the predictive performance of a model can degrade. This is known as 'model drift', and it can result in a loss of money and competitive advantage.
In ModelOp Center, during production implementation, a MLC Process is typically implemented that runs periodic statistical tests against new labeled data and triggers alerts when thresholds are breached. A data scientist can also determine the current performance of the model by manually running ad hoc Batch Jobs using the CLI or the Command Center. They can also run a Champion/Challenger Model Comparison. All of these tests are persisted with the specific version of the model for auditability. ModelOp Center provides comprehensive operational, quality, risk, and process monitoring throughout the entire life cycle of a model. ModelOp Center uses the concept of an “associated model” that allows the user to “associate” specific monitors for the model and run these monitors routinely--either on a scheduled or triggered basis. ModelOp Center ships with a number of monitors out of the box, which the user can select and use without modification. Additionally, the user may decide to write his/her own custom monitoring function which can be registered as an associated model and set to run for the user’s model. This gives the enterprise the flexibility to select the best metrics to monitor their unique requirements from a business, technical, and risk perspective. Furthermore, these monitors are integrated into model life cycles, allowing the user to not only observe issues via the monitor, but to automatically compare the monitor outcomes against model-specific thresholds and take remediation action if there are deviations.
The subsequent sections provide an overview of monitor selection as well as how to test monitors within ModelOp Center. Subsequent articles go into detail on enabling statistical monitoring, drift monitoring, and ethical fairness monitoring.
Choosing Evaluation Metrics
To test the efficacy of a model, a metric should be chosen during model development and used to benchmark the model. The chosen metric should reflect the underlying business problem. For instance, in a binary classification problem with very unbalanced class frequencies, accuracy is a poor choice of metric. A “model” which always predicts that the more common class will occur will be very accurate, but will not do a good job of predicting the less frequent class.
Take compliance in internal communications as an example. Very few internal communications may be non-compliant, but a model which never flags possible non-compliance is worthless even if it is highly accurate. A better metric in this case is an F1 score or an Fβ score for β> 1 more generally. The latter will reward the model more for true positives and punish the model for false negatives, occurrences where the model fails to detect non-compliant communication.
Similarly, for regression problems, the data scientist should decide on a metric based on whether a few bad errors with most being small is preferable in which case she should use mean absolute error (MAE); or whether no errors should exceed a particular threshold in which case the data scientist should use the max error. A metric like root mean squared error (RMSE) interpolates between these cases.
There are metrics for every type of problem: multi-class classification, all varieties of regression, unsupervised clustering, etc. They can range from quite simple to quite intricate, but whatever the problem, a metric should be decided upon early in development and used to test a model as it is promoted to UAT and then into production. Here are some tests it might encounter along the way.
The F1 score is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples
SHAP values (interpretability), is used on a per-record basis to justify why a particular record or client got the score they did. This makes SHAP fit into the action/scoring function more than it does in the Metrics Function
The ROC Curve to determine the ratio of true positives to false positives
The AUC (Area Under the ROC Curve)
Note: There can be other items that determine which model to promote to production. For example, the situation may favor a model with better inference speed, interpretability etc.
Out of the Box Metrics
ModelOp Center ships with multiple out of the box monitors, which are registered as associated models. The user may add one of these associated monitors to his/her model or decide to write a custom metric function (see next section). Here is a sampling of out of the box monitors across 4 categories:
Operational Performance:
Automatically monitor model operations to ensure that models are running at agreed upon service levels and delivering decisions at the rate expected. Operational performance monitors include:
Model availability and SLA performance
Data throughput and latency with inference execution
Volume and frequency of input requests for the
application
Input data adherence to the defined schema for model
Input data records for inferences are within established
range
Quality Performance:
Ensure that model decisioning and outcomes are within established data quality controls, eliminating the risk of unexpected and inaccurate decisions. Quality performance monitors include:
Writing the Metric FunctionData drift of input data
Concept drift of output
Statistical effectiveness of model output
Design and automate model monitoring workflows
Risk Performance
Controlling risk and ensuring models are constantly operating within established business, risk and compliance ranges as well as delivering ethically fair results is a constant challenge. Prevent out of compliance issues with automated, continuous risk performance monitoring. Risk performance monitors include:
Ethical fairness of model output
Interpretability of model features weighting
Process Performance:
Continuous monitoring of the end-to-end model operations process ensures that all steps are properly executed and adhered to. Collect and retain data for each step in the model life cycle, resulting in reproducibility and auditability. Process performance monitors includes:
Registration processes
Operational processes
Monitoring processes
Governance processes
Writing a Custom Monitor (Metric Function)
The Metrics Function allows you to define the test that calculates the Evaluation metrics for your modelcustom metrics that you would like to monitor for your model. This metrics function would be included in source code that is registered as a model in ModelOp Center and then added as an associated model for monitoring. You can use the Metrics Job to manually execute this script against data, or use an MLC Process to trigger automatic execution. See Model Batch Jobs and Tests for more information.
You can specify a Metrics Function either with a #modelop.metrics
smart tag comment before the function definition, or you can select it within the Command Center after the model source code is registered. The Metrics Function executes against a batch of records and yields test results as a JSON object of the form {“metric_1”: <value_1>, …, “metric_n”: <value_n>}
. These values are used to populate the Test Results visuals within the UI (as seen at the bottom of this page).
Here is an example of how to code a Metrics Function. It is calculating the ROC Curve, AUC, F2, and the Confusion Matrix.
Code Block | ||
---|---|---|
| ||
#modelop.metrics def metrics(x): lasso_model = lasso_model_artifacts['lasso_model'] dictionary = lasso_model_artifacts['dictionary'] threshold = lasso_model_artifacts['threshold'] tfidf_model = lasso_model_artifacts['tfidf_model'] actuals = x.flagged cleaned = preprocess(x.content) corpus = cleaned.apply(dictionary.doc2bow) corpus_sparse = gensim.matutils.corpus2csc(corpus).transpose() corpus_sparse_padded = pad_sparse_matrix(sp_mat = corpus_sparse, length=corpus_sparse.shape[0], width = len(dictionary)) tfidf_vectors = tfidf_model.transform(corpus_sparse_padded) probabilities = lasso_model.predict_proba(tfidf_vectors)[:,1] predictions = pd.Series(probabilities > threshold, index=x.index).astype(int) confusion_matrix = sklearn.metrics.confusion_matrix(actuals, predictions) fpr,tpr,thres = sklearn.metrics.roc_curve(actuals, predictions) auc_val = sklearn.metrics.auc(fpr, tpr) f2_score = sklearn.metrics.fbeta_score(actuals, predictions, beta=2) roc_curve = [{'fpr': x[0], 'tpr':x[1]} for x in list(zip(fpr, tpr))] labels = ['Compliant', 'Non-Compliant'] cm = matrix_to_dicts(confusion_matrix, labels) test_results = dict(roc_curve=roc_curve, auc=auc_val, f2_score=f2_score, confusion_matrix=cm) yield test_results |
Here is an example of expected output from this function:
Code Block | ||
---|---|---|
| ||
{ "roc_curve": [{"fpr": 0.0, "tpr": 0.0}, {"fpr": 0.02564102564102564, "tpr": 0.6666666666666666}, {"fpr": 1.0, "tpr": 1.0}], "auc": 0.8205128205128204, "f2_score": 0.625, "confusion_matrix": [{"Compliant": 76, "Non-Compliant": 2}, {"Compliant": 1, "Non-Compliant": 2}] } |
Running a
MetricMonitor (Metrics Job) Manually
Run a Metric Job Manually from the CLI
To create a ‘metrics job’ from the CLI, use the command
moc job create testjob <deployable-model-uuid> <input-file-name> <output-file-name> optional-flags
This command yields a UUID for the job.
To find the raw JSON results of the job, use the command
moc job result <uuid>
Run a Metric Job Manually from the Command Center UI
See Model Batch Jobs and Tests Manually Create a Batch Job in the Command Center.
View the results of a Monitor (Metrics Job)
To see the results of a monitor or test, navigate to Models and select the model whose tests you would like to view.
Click on the version
2. Select the Snapshot (version) of the model to view the Test Results.
3. Click on the test results for that version.
4. The following example measures consumer credit classification including the ROC.
Next Article: Monitor a Deployed Model >