Introduction

Testing the efficacy of a model and testing it operationally is important to ensure the quality of the predictions the model provides to business. When new production data has statistical properties that violate the assumptions made at the time of the original creation, the predictive performance of a model can degrade. This is known as 'model drift', and it can result in a loss of money and competitive advantage.

In ModelOp Center, during production implementation, a MLC Process is typically implemented that runs periodic statistical tests against new labeled data and triggers alerts when thresholds are breached. A data scientist can also determine the current performance of the model by manually running ad hoc Batch Jobs using the CLI or the Command Center. They can also run a Champion/Challenger Model Comparison. All of these tests are persisted with the specific version of the model for auditability.

Choosing Evaluation Metrics

To test the efficacy of a model, a metric should be chosen during model development and used to benchmark the model. The chosen metric should reflect the underlying business problem. For instance, in a binary classification problem with very unbalanced class frequencies, accuracy is a poor choice of metric. A “model” which always predicts that the more common class will occur will be very accurate, but will not do a good job of predicting the less frequent class.

Take compliance in internal communications as an example. Very few internal communications may be non-compliant, but a model which never flags possible non-compliance is worthless even if it is highly accurate. A better metric in this case is an F₁ score or an F_β score for β> 1 more generally. The latter will reward the model more for true positives and punish the model for false negatives, occurrences where the model fails to detect non-compliant communication.

Similarly, for regression problems, the data scientist should decide on a metric based on whether a few bad errors with most being small is preferable in which case she should use mean absolute error (MAE); or whether no errors should exceed a particular threshold in which case the data scientist should use the max error. A metric like root mean squared error (RMSE) interpolates between these cases.
There are metrics for every type of problem: multi-class classification, all varieties of regression, unsupervised clustering, etc. They can range from quite simple to quite intricate, but whatever the problem, a metric should be decided upon early in development and used to test a model as it is promoted to UAT and then into production. Here are some tests it might encounter along the way.

The F1 score is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples
SHAP values (interpretability), is used on a per-record basis to justify why a particular record or client got the score they did. This makes SHAP fit into the action/scoring function more than it does in the Metrics Function
The ROC Curve to determine the ratio of true positives to false positives
The AUC (Area Under the ROC Curve)

Note: There can be other items that determine which model to promote to production. For example, the situation may favor a model with better inference speed, interpretability etc.

Writing the Metric Function

The Metrics Function allows you to define the test that calculates the Evaluation metrics for your model. You can use the Metrics Job to manually execute this script against data, or use an MLC Process to trigger automatic execution. See Model Batch Jobs and Tests for more information.

You can specify a Metrics Function either with a #modelop.metrics smart tag comment before the function definition, or you can select it within the Command Center after the model source code is registered. The Metrics Function executes against a batch of records and yields test results as a JSON object of the form {“metric_1”: <value_1>, …, “metric_n”: <value_n>}. These values are used to populate the Test Results visuals within the UI (as seen at the bottom of this page).

Here is an example of how to code a Metrics Function. It is calculating the ROC Curve, AUC, F₂, and the Confusion Matrix.

#modelop.metrics
def metrics(x):
    lasso_model = lasso_model_artifacts['lasso_model']
    dictionary = lasso_model_artifacts['dictionary']
    threshold = lasso_model_artifacts['threshold']
    tfidf_model = lasso_model_artifacts['tfidf_model']

    actuals = x.flagged
    
    cleaned = preprocess(x.content)
    corpus = cleaned.apply(dictionary.doc2bow)
    corpus_sparse = gensim.matutils.corpus2csc(corpus).transpose()
    corpus_sparse_padded = pad_sparse_matrix(sp_mat = corpus_sparse, 
                                             length=corpus_sparse.shape[0], 
                                             width = len(dictionary))
    tfidf_vectors = tfidf_model.transform(corpus_sparse_padded)

    probabilities = lasso_model.predict_proba(tfidf_vectors)[:,1]

    predictions = pd.Series(probabilities > threshold, index=x.index).astype(int) 
    
    confusion_matrix = sklearn.metrics.confusion_matrix(actuals, predictions)
    fpr,tpr,thres = sklearn.metrics.roc_curve(actuals, predictions)

    auc_val = sklearn.metrics.auc(fpr, tpr)
    f2_score = sklearn.metrics.fbeta_score(actuals, predictions, beta=2)

    roc_curve = [{'fpr': x[0], 'tpr':x[1]} for x in list(zip(fpr, tpr))]
    labels = ['Compliant', 'Non-Compliant']
    cm = matrix_to_dicts(confusion_matrix, labels)
    test_results = dict(roc_curve=roc_curve,
                   auc=auc_val,
                   f2_score=f2_score,
                   confusion_matrix=cm)    

    yield test_results

Here is an example of expected output from this function:

{
"roc_curve": 
    [{"fpr": 0.0, "tpr": 0.0}, 
     {"fpr": 0.02564102564102564, "tpr": 0.6666666666666666}, 
     {"fpr": 1.0, "tpr": 1.0}], 
  "auc": 0.8205128205128204, 
  "f2_score": 0.625, 
  "confusion_matrix": 
    [{"Compliant": 76, "Non-Compliant": 2}, 
     {"Compliant": 1, "Non-Compliant": 2}]
}

Running a Metric Job

Run a Metric Job from the CLI

To create a ‘metrics job’ from the CLI, use the command

moc job create testjob <deployable-model-uuid> <input-file-name> <output-file-name> optional-flags

This command yields a UUID for the job.
To find the raw JSON results of the job, use the command

moc job result <uuid>

Run a Metric Job from the Command Center UI

See Model Batch Jobs and Tests.

View the results of a Metrics Job

To see the results of a test, navigate to Models and select the model whose tests you would like to view.
Click on the version of the model who’s test you would like to view.
Click on the test results for that version.
The following example measures consumer credit metrics and suggests possible outcomes for a particular credit applicant.

Next Article: Monitor a Deployed Model >

v2.1 Model Efficacy Metrics and Monitoring