Background
Avro Schemas
ModelOp Center utilizes the Avro specification for schema checking. It can be found hereThis article describes how ModelOp Center enables schema generation and use.
Table of Contents
Table of Contents | ||||
---|---|---|---|---|
|
Background
Overview
As mentioned in prior articles, ModelOp Center uses externalized schemas to ensure that the model data ingress adheres to what the model expects and that the model output adheres to what the consuming application/process expects.
Avro Schemas
ModelOp Center utilizes the Avro specification for schema checking. It can be found here: https://avro.apache.org/docs/current/spec.html
For most models, records, arrays, or simple types suffice; but in some instances, more complex structures are required, especially when input data is nested.
An Example
Let’s consider a model that takes as input the following records in a .json
file:
...
The corresponding Avro schema is the following object:
Expand | |||||
---|---|---|---|---|---|
| |||||
|
The Avro schema declares the field names and their allowable types.
By default, if a field is listed in the schema, it cannot be omitted from the input/output record. In addition, the value of a field must match (one of) the allowable types as declared in the schema.
The
key:value
pair"type": "record"
at the top-level of the JSON object indicates the overall structure of the input, i.e., a dictionary ofkey:value
pairs. If instead of the records above, we have arrays of records such asCode Block [{"UUID": "9a5d9f42-3f36-4f38-88dd-22353fdb66a7", "amount": 8875.50, "home_ownership": "MORTGAGE", "age": "Over Forty"}], [{"UUIDcredit_age": 4511, "f8d95245-a186"employed": true, "label": 1, "prediction": 1}] [{"UUID": "f8d95245-a186-45a6-b951-376323d06d02", "amount": 9000, "home_ownership": "MORTGAGE", "age": "Under Forty", "credit_age": 7524, "employed": false, "label": 0, "prediction": 1}] [{"UUID": "8607e327-4dca-4372-a4b9-df7730f83c8e", "amount": 5000.50, "home_ownership": "RENT", "age": "Under Forty", "credit_age": null, "employed": true, "label": 0, "prediction": 0}]
then the Avro schema would have to wrap the inner record in an array, as follows:
Expand | |||||
---|---|---|---|---|---|
| |||||
|
...
|
...
|
...
Type Unions & Missing Values
The Avro specification allows for type unions. In the example above, the field amount
takes on both integer and double values. This can be accommodated in the schema through the array "type" : ["int", "double"]
. This means that both integer and doubles will be allowed.
If a field is allowed to have missing values, such as credit_age
in the example above, the Avro schema can accommodate this through a type union that includes "null"
along with the base type, such as "type": ["null", "int"]
.
...
|
Type Unions & Missing Values
The Avro specification allows for type unions. In the example above, the field "amount"
takes on both integer and double values. This can be accommodated in the schema through the array "type" : ["int", "double"]
. This means that both integer and doubles will be allowed.
If a field is allowed to have missing values, such as "credit_age"
in the example above, the Avro schema can accommodate this through a type union that includes "null"
along with the base type, such as "type": ["int", "null"]
.
Out-of-Bounds Values
The Avro specification does not allow for checking the bounds of numerical fields. As such, if a value is out-of-bounds, it is recommended to put a check in the model source code which changes the type of that field. For instance, if a model is supposed to output a probability, then in the (Python) source code one could include the line
Code Block | ||
---|---|---|
| ||
output["probability"] = output.probability.apply(lambda x: x if (x < 1) and (x > 0) else "Out of Bounds") |
The normal Avro schema checking will then fail the check as the numerical probability has been replaced by a string.
Extended Schema & Monitoring
To enable monitoring out-of-the-box (OOTB), ModelOp Center introduced in V2.4 the concept of an extended schema. We will go over the details below, but in short, an extended schema is a rich Avro schema, so that more information about the data can be learned. Traditionally, Avro schemas specify field names and types. Extended schemas add more key:value
pairs to each field, so that OOTB monitors can make reasonable assumptions, such as inferring the role of a field ("predictor"
, "identifier"
, etc.)
An Example
Extended schemas are best understood through an example. Let’s consider the same records from the previous example:
Code Block | ||
---|---|---|
| ||
{"UUID": "9a5d9f42-3f36-4f38-88dd-22353fdb66a7", "amount": 8875.50, "home_ownership": "MORTGAGE", "age": "Over Forty", "credit_age": 4511, "employed": true, "label": 1, "prediction": 1}
{"UUID": "f8d95245-a186-45a6-b951-376323d06d02", "amount": 9000, "home_ownership": "MORTGAGE", "age": "Under Forty", "credit_age": 7524, "employed": false, "label": 0, "prediction": 1}
{"UUID": "8607e327-4dca-4372-a4b9-df7730f83c8e", "amount": 5000.50, "home_ownership": "RENT", "age": "Under Forty", "credit_age": null, "employed": true, "label": 0, "prediction": 0} |
The corresponding extended schema is the following object:
Expand | |||||
---|---|---|---|---|---|
| |||||
|
Let us dissect the additional keys in this JSON object in the section below.
Extended Schema Fields & Values
dataClass
"dataClass"
is used to indicate whether a certain field should be treated as a categorical feature or a numerical feature. While highly correlated to "type"
, knowing the type is not sufficient to accurately set a "dataClass"
. For example, it is not uncommon in Data Science to use integers to represent categories.
Possible values
"categorical"
or "numerical"
.
role
"role"
is used to indicate the purpose of a field in a dataset. Certain roles can be inferred from "name"
, such as "role": "score"
if the field name is "name": "score"
or "name": "prediction"
. Others are harder to infer and should be assigned manually, such as "role": "weight"
.
Possible values
"identifier"
, "predictor"
, "non_predictor"
, "weight"
, "score"
, "label"
.
Reserved field names
"id"
,"UUID"
("role"
is set to"identifier"
)"score"
,"prediction"
("role"
is set to"score"
)"label"
,"ground_truth"
("role"
is set to"label"
)All other fields are set to
"role": "predictor"
.
protectedClass
"protectedClass"
is a boolean field used to indicate whether or not a certain field corresponds to a protected attribute. Certain fields can be set as protected OOTB, such as "name": "gender"
. Others are harder to infer and must be assigned manually.
Possible values
true
or false
.
Reserved field names
"race"
,"color"
,"religion"
,"sex"
,"gender"
,"pregnancy"
,"sexual_orientation"
,"gender_identity"
,"national_origin"
,"age"
,"disability"
("protectedClass"
set totrue
)All other fields are set to
"protectedClass": false
.
driftCandidate
"driftCandidate"
is a boolean field used to indicate whether or not a certain field should be considered for drift monitoring.
Possible values
true
or false
.
Reserved field roles
"non_predictor"
,"identifier"
,"weight"
("driftCandidate"
set tofalse
)All other roles result in
"driftCandidate"
being set totrue
specialValues
"specialValues"
is used to indicate whether some values of the field should be considered separately from the rest, especially when using a stability monitor. This field is not populated OOTB and must be provided manually.
If no special values are present, the default is an empty array:
"specialValues": []
.Otherwise,
"specialValues"
is an array of JSON objects, with keys"values"
and"purpose"
;"values"
is an array of any type, and"purpose"
is a string.The following are all valid examples:
Code Block "specialValues": []
Code Block "specialValues": [ { "values": ["N/A"], "purpose": "Field Not Applicable" } ]
Code Block language json "specialValues": [ { "values": [999, 998], "purpose": "Flagged for review" }, { "values": [-1000], "purpose": "Invalid input" } ]
scoringOptional
"scoringOptional"
is a boolean field used to indicate whether or not a field is optional for the scoring function. As a reminder, the presence of a field in the schema makes it required by default, and thus a record missing a required field will be rejected by the schema.
This is limiting, as one might want to specify a "score"
or "label"
field in the extended input schema, even though these fields are most likely not present in the input records. Thus, one can make these fields optional for the scoring job, which guarantees that a record not containing them will not be rejected by the schema.
Possible values
true
or false
.
Reserved field roles and protected classes
If a field role is one of
"label"
,"score"
, or"weight"
,"scoringOptional"
is set totrue
.If a field has
"protectedClass": true
,"scoringOptional"
is set totrue
.Otherwise,
"scoringOptional"
is set tofalse
.
Generating Extended Schemas
Using the UI
To generate an extended schema for a business model:
Navigate to the corresponding storedModel in the MOC UI (under Models).
Click on Schemas.
Click on Generate Extended Schema. You should see the following window pop-up:
Enter the data you want to use to infer a schema in the top box. The data must be formatted as one-line dictionaries, as in the sample data above.
Click on Generate Schema.
The schema can then be downloaded or saved as Input/Output Schema.
The recommended best practice is to download the generated schema, and then add it as an asset to the business model being monitored in the model’s git repository. Once the schema is properly versioned along with the source code (e.g. in a Github repo), one doesn’t have to regenerate the schema anymore. Note that MOC will not push the generated schema to the model repo; it is up to the user to do so.
Note: If generating the extended schema for monitoring purposes, you should save it as an input schema; the OOTB monitors will look for an extended input schema to set the monitoring parameters.
Using Extended Schemas
For Scoring Jobs
MOC allows for one input schema and one output schema per business model. In order to enable MOC to recognize these files OOTB, follow the naming convention:
input_schema.avsc
for the input schemaoutput_schema.avsc
for the output schema
To signal to ModelOp runtimes that schemas are to be used for scoring jobs on a particular model, we add the following smart comments at the top of the primary source code:
Code Block |
---|
# modelop.schema.0: input_schema.avsc
# modelop.schema.1: output_schema.avsc |
Note that the pound sign
#
assumes that the model is a Python model. You should use whatever syntax is reserved for one-line comments in the programming language of the model.The primary source code is the code file where the scoring function is defined.
In certain cases, one might want to enable schema checking on either input or output, but not both. MOC allows one to do so. Say, for example, that you want to enable schema checking on input data, but not on outputs. The smart comments, in this case, should be:
Code Block |
---|
# modelop.schema.0: input_schema.avsc
# modelop.slot.1: in-use |
if neither slot (input/output) is to be schema-checked, one can leave the smart comments off altogether.
Schema Enforcement on ModelOp Runtimes
REST
If a model is deployed to a ModelOp Center runtime as a REST endpoint, the smart comments described above will be used to determine which slots are to be schema-checked. With schema-checking enabled, requests made to that runtime that fail either the input or output schema checks return a 400 error with a rejected by schema message.
Batch Jobs
Batch jobs can be run from the ModelOp Center UI or from the CLI with schema checking enabled. Batch Jobs are more flexible with schema checking, as one could override the smart comments when creating the job under Job Options. When schema checking is enabled, records that do not conform to the provided schema are filtered out. If the input record fails the check against the input schema, then it is simply rejected by the schema and not scored. If the output fails the output schema check, then the record is scored, but won’t be piped to the output file. This is so that the output is not allowed into a downstream application where it could cause errors.
REST
If a model is deployed to a ModelOp Center Runtime as a REST endpoint with schema checking enabled, requests made to that Runtime that fail either the input or output schema checks return a 400 error with a rejected by schema message.
Out-of-Bounds Values
The Avro specification does not allow for checking the bounds of numerical fields. As such, If a value is out-of-bounds, it is recommended to put a check in the model source code which changes the type of that field. For instance, if a model is supposed to output a probability, then in the (Python) source code one could include the line
Code Block | ||
---|---|---|
| ||
output["probability"] = output.probability.apply(lambda x: x if (x < 1) and (x > 0) else "Out of Bounds") |
The normal Avro schema checking will then fail the check as the numerical probability has been replaced by a stringwhere it could cause errors.
For Monitoring
Monitoring OOTB requires that the business model has an extended input schema. When a monitoring job is created, the monitor’s init
function accesses the extended input schema, and uses it to determine certain monitoring parameters, such as the names of the fields corresponding to specific roles (score, label, etc.)
Updating Schemas
In some cases, a user will have to edit the generated schema manually, particularly when certain fields are too complex to be interpreted correctly (as the author intended) by the inference tool. The MOC UI allows you to edit the schemas on a storedModel. To do so:
Navigate to the storedModel in the MOC UI (under Models).
Click on Schemas.
Choose the schema you wish to edit from the menu on the left.
You should see two views of the schema: a Table view (most helpful for extended schemas), and a JSON view:
Click on either Edit Table or Edit JSON:
Click on Save Changes. Edits made to one object (JSON/Table) will be reflected in the other once they are saved.