Background

Avro Schemas

ModelOp Center utilizes the Avro specification for schema checking. It can be found here: https://avro.apache.org/docs/current/spec.html

For most models, records, arrays, or simple types suffice; but in some instances, more complex structures are required, especially when input data is nested.

An Example

Let’s consider a model that takes as input the following records in a .json file:

{"UUID": "9a5d9f42-3f36-4f38-88dd-22353fdb66a7", "amount": 8875.50, "home_ownership": "MORTGAGE", "age": "Over Forty", "credit_age": 4511, "employed": true, "label": 1, "prediction": 1}
{"UUID": "f8d95245-a186-45a6-b951-376323d06d02", "amount": 9000, "home_ownership": "MORTGAGE", "age": "Under Forty", "credit_age": 7524, "employed": false, "label": 0, "prediction": 1}
{"UUID": "8607e327-4dca-4372-a4b9-df7730f83c8e", "amount": 5000.50, "home_ownership": "RENT", "age": "Under Forty", "credit_age": null, "employed": true, "label": 0, "prediction": 0}

The corresponding Avro schema is the following object:

input_schema.avsc

{
    "type": "record",
    "name": "inferred_schema",
    "fields": [
        {
            "name": "UUID",
            "type": "string"
        },
        {
            "name": "amount",
            "type" : ["int", "double"]
        },
        {
            "name": "home_ownership",
            "type": "string"
        },
        {
            "name": "age",
            "type": "string"
        },
        {
            "name": "credit_age",
            "type": ["null", "int"]
        },
        {
            "name": "employed",
            "type": "boolean"
        },
        {
            "name": "label",
            "type": "int"
        },
        {
            "name": "prediction",
            "type": "int"
        }
    ]
}

The Avro schema declares the field names and their allowable types.
By default, if a field is listed in the schema, it cannot be omitted from the input/output record. In addition, the value of a field must match (one of) the allowable types as declared in the schema.

The key:value pair "type": "record" at the top-level of the JSON object indicates the overall structure of the input, i.e., a dictionary of key:value pairs. If instead of the records above, we have arrays of records such as

[{"UUID": "9a5d9f42-3f36-4f38-88dd-22353fdb66a7", "amount": 8875.50, "home_ownership": "MORTGAGE", "age": "Over Forty"}]
[{"UUID": "f8d95245-a186-45a6-b951-376323d06d02", "amount": 9000, "home_ownership": "MORTGAGE", "age": "Under Forty"}]
[{"UUID": "8607e327-4dca-4372-a4b9-df7730f83c8e", "amount": 5000.50, "home_ownership": "RENT", "age": "Under Forty"}]

then the Avro schema would have to wrap the inner record in an array, as follows:

{
    "type": "array",
    "items": {
        "type": "record",
        "name": "inferred_schema",
        "fields": [
            {
                "name": "UUID",
                "type": "string"
            },
            {
                "name": "amount",
                "type" : ["int", "double"]
            },
            {
                "name": "home_ownership",
                "type": "string"
            },
            {
                "name": "age",
                "type": "string"
            }
        ]
    }
}

Type Unions & Missing Values

The Avro specification allows for type unions. In the example above, the field amount takes on both integer and double values. This can be accommodated in the schema through the array "type" : ["int", "double"]. This means that both integer and doubles will be allowed.

If a field is allowed to have missing values, such as credit_age in the example above, the Avro schema can accommodate this through a type union that includes "null" along with the base type, such as "type": ["null", "int"].

Out-of-Bounds Values

The Avro specification does not allow for checking the bounds of numerical fields. As such, If a value is out-of-bounds, it is recommended to put a check in the model source code which changes the type of that field. For instance, if a model is supposed to output a probability, then in the (Python) source code one could include the line

output["probability"] = output.probability.apply(lambda x: x if (x < 1) and (x > 0) else "Out of Bounds")

The normal Avro schema checking will then fail the check as the numerical probability has been replaced by a string.

Schema Enforcement on ModelOp Runtimes

Batch Jobs

Batch jobs can be run from the ModelOp Center UI or from the CLI with schema checking enabled. When schema checking is enabled, records that do not conform to the provided schema are filtered out. If the input record fails the check against the input schema, then it is simply rejected by the schema and not scored. If the output fails the output schema check, then the record is scored, but won’t be piped to the output file. This is so that the output is not allowed into a downstream application where it could cause errors.

REST

If a model is deployed to a ModelOp Center Runtime as a REST endpoint with schema checking enabled, requests made to that Runtime that fail either the input or output schema checks return a 400 error with a rejected by schema message.

Extended Schema & Monitoring

To enable monitoring out-of-the-box (OOTB), ModelOp Center introduced in V2.4 the concept of an extended schema. We will go over the details below, but in short, an extended schema is a rich Avro schema, so that more information about the data can be learned. Traditionally, Avro schemas specify field names and types. Extended schemas add more key:value pairs to each field, so that OOTB monitors can make reasonable assumptions, such as inferring the role of a field (predictor, identifier, etc.)

An Example

Let’s consider the same records from the previous example:

{"UUID": "9a5d9f42-3f36-4f38-88dd-22353fdb66a7", "amount": 8875.50, "home_ownership": "MORTGAGE", "age": "Over Forty", "credit_age": 4511, "employed": true, "label": 1, "prediction": 1}
{"UUID": "f8d95245-a186-45a6-b951-376323d06d02", "amount": 9000, "home_ownership": "MORTGAGE", "age": "Under Forty", "credit_age": 7524, "employed": false, "label": 0, "prediction": 1}
{"UUID": "8607e327-4dca-4372-a4b9-df7730f83c8e", "amount": 5000.50, "home_ownership": "RENT", "age": "Under Forty", "credit_age": null, "employed": true, "label": 0, "prediction": 0}

The corresponding extended schema is the following object:

input_schema.avsc

{
    "type": "record",
    "name": "inferred_schema",
    "fields": [
        {
            "name": "UUID",
            "type": "string",
            "dataClass": "categorical",
            "role": "identifier",
            "protectedClass": false,
            "driftCandidate": false,
            "specialValues": [],
            "scoringOptional": false
        },
        {
            "name": "amount",
            "type": [
                "int",
                "double"
            ],
            "dataClass": "numerical",
            "role": "predictor",
            "protectedClass": false,
            "driftCandidate": true,
            "specialValues": [],
            "scoringOptional": false
        },
        {
            "name": "home_ownership",
            "type": "string",
            "dataClass": "categorical",
            "role": "predictor",
            "protectedClass": false,
            "driftCandidate": true,
            "specialValues": [],
            "scoringOptional": false
        },
        {
            "name": "age",
            "type": "string",
            "dataClass": "categorical",
            "role": "predictor",
            "protectedClass": true,
            "driftCandidate": true,
            "specialValues": [],
            "scoringOptional": true
        },
        {
            "name": "credit_age",
            "type": [
                "null",
                "int"
            ],
            "dataClass": "numerical",
            "role": "predictor",
            "protectedClass": false,
            "driftCandidate": true,
            "specialValues": [],
            "scoringOptional": false
        },
        {
            "name": "employed",
            "type": "boolean",
            "dataClass": "categorical",
            "role": "predictor",
            "protectedClass": false,
            "driftCandidate": true,
            "specialValues": [],
            "scoringOptional": false
        },
        {
            "name": "label",
            "type": "int",
            "dataClass": "categorical",
            "role": "label",
            "protectedClass": false,
            "driftCandidate": true,
            "specialValues": [],
            "scoringOptional": true
        },
        {
            "name": "prediction",
            "type": "int",
            "dataClass": "categorical",
            "role": "score",
            "protectedClass": false,
            "driftCandidate": true,
            "specialValues": [],
            "scoringOptional": true
        }
    ]
}

Schema Generation

Background

Avro Schemas

An Example

Type Unions & Missing Values

Out-of-Bounds Values

Schema Enforcement on ModelOp Runtimes

Batch Jobs

REST

Extended Schema & Monitoring

An Example