Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To enable monitoring out-of-the-box (OOTB), ModelOp Center introduced in V2.4 the concept of an extended schema. We will go over the details below, but in short, an extended schema is a rich Avro schema, so that more information about the data can be learned. Traditionally, Avro schemas specify field names and types. Extended schemas add more key:value pairs to each field, so that OOTB monitors can make reasonable assumptions, such as inferring the role of a field (predictor, identifier, etc.)

An Example

Extended schemas are best understood through an example. Let’s consider the same records from the previous example:

...

Expand
titleinput_schema.avsc
Code Block
languagejson
{
    "type": "record",
    "name": "inferred_schema",
    "fields": [
        {
            "name": "UUID",
            "type": "string",
            "dataClass": "categorical",
            "role": "identifier",
            "protectedClass": false,
            "driftCandidate": false,
            "specialValues": [],
            "scoringOptional": false
        },
        {
            "name": "amount",
            "type": [
                "int",
                "double"
            ],
            "dataClass": "numerical",
            "role": "predictor",
            "protectedClass": false,
            "driftCandidate": true,
            "specialValues": [],
            "scoringOptional": false
        },
        {
            "name": "home_ownership",
            "type": "string",
            "dataClass": "categorical",
            "role": "predictor",
            "protectedClass": false,
            "driftCandidate": true,
            "specialValues": [],
            "scoringOptional": false
        },
        {
            "name": "age",
            "type": "string",
            "dataClass": "categorical",
            "role": "predictor",
            "protectedClass": true,
            "driftCandidate": true,
            "specialValues": [],
            "scoringOptional": true
        },
        {
            "name": "credit_age",
            "type": [
                "null",
                "int"
            ],
            "dataClass": "numerical",
            "role": "predictor",
            "protectedClass": false,
            "driftCandidate": true,
            "specialValues": [],
            "scoringOptional": false
        },
        {
            "name": "employed",
            "type": "boolean",
            "dataClass": "categorical",
            "role": "predictor",
            "protectedClass": false,
            "driftCandidate": true,
            "specialValues": [],
            "scoringOptional": false
        },
        {
            "name": "label",
            "type": "int",
            "dataClass": "categorical",
            "role": "label",
            "protectedClass": false,
            "driftCandidate": true,
            "specialValues": [],
            "scoringOptional": true
        },
        {
            "name": "prediction",
            "type": "int",
            "dataClass": "categorical",
            "role": "score",
            "protectedClass": false,
            "driftCandidate": true,
            "specialValues": [],
            "scoringOptional": true
        }
    ]
}

Let us dissect the additional keys in this JSON object in the subsections below.

dataClass

"dataClass" is used to indicate whether a certain field should be treated as a categorical feature or a numerical feature. While highly correlated to "type", knowing the type is not sufficient to accurately set a "dataClass". For example, it is not uncommon in Data Science to use integers to represent categories.

Possible values

"categorical" or "numerical".

role

"role" is used to indicate the purpose of a field in a dataset. Certain roles can be inferred from "name", such as "role": "score" if the field name is "name": "score" or "name": "prediction". Others are harder to infer and should be assigned manually, such as "role": "weight".

Possible values

"identifier", "predictor", "non_predictor", "weight", "score", "label".

Reserved field names

  • "id", "UUID" ("role" is set to "identifier")

  • "score", "prediction" ("role" is set to "score")

  • "label", "ground_truth" ("role" is set to "label")

  • All other fields are set to "role": "predictor".

protectedClass

"protectedClass" is a boolean field used to indicate whether or not a certain field corresponds to a protected attribute. Certain fields can be set as protected OOTB, such as "name": "gender". Others are harder to infer, and must be assigned manually.

Possible values

true or false.

Reserved field names

  • "race", "color", "religion", "sex", "gender", "pregnancy", "sexual_orientation", "gender_identity", "national_origin", "age", "disability" ("protectedClass" set to true)

  • All other fields are set to "protectedClass": false.

driftCandidate

"driftCandidate" is a boolean field used to indicate whether or not a certain field should be considered for drift monitoring.

Possible values

true or false.

Reserved field roles

  • "non_predictor", "identifier", "weight" ("driftCandidate" set to false)

  • All other roles result in "driftCandidate" being set to true

specialValues

"specialValues" is used to indicate whether some values of the field should be considered separately from the rest, especially when using a stability monitor. This field is not populated OOTB, and must be provided manually.

  • If no special values are present, the default is an empty array: "specialValues": [].

  • Otherwise, "specialValues" is an array of JSON objects, with keys "values" and "purpose"; "values" is an array of any type, and "purpose" is a string.

  • The following are all valid examples:

    • Code Block
      "specialValues": []
    • Code Block
      "specialValues": [
          {
              "values": ["N/A"], 
              "purpose": "Field Not Applicable"
          }
      ]
    • Code Block
      languagejson
      "specialValues": [
          {
              "values": [999, 998], 
              "purpose": "Flagged for review"
          }, 
          {
              "values": [-1000], 
              "purpose": "Invalid input"
          }
      ]