Introduction
ModelOp Center provides the ability to integrate with SQL-based databases. Batch jobs can currently use SQL assets as inputs or outputs (or both). When used as an input, the provided SQL statement is run once, and the rows returned are fed into the underlying model, one at a time, as individual records. When used an output, it is expected that the model emit a flat dictionary of key/value pairs, and the output SQL statement is executed once for each emitted record, with the values from the dictionary mapped into the SQL statement via the keys and the parameters listed in the SQL asset.
SQL Asset Overview
SQL assets have 3 main components: the connection string, the SQL statement itself, and the parameters.
The connection string
Connection strings all follow the same basic pattern:
scheme://username:password@host:port/database
The scheme
indicates which underlying odbc driver should be used to connect to the DB in question. Currently, the Centos and UBI 8 images support mysql
, sqlserver
, snowflakedsiidriver
, db2
, sqlserver
, databricks
, and postgres
. Additional drivers can be installed, so long as the /etc/odbcinst.ini
file is properly updated.
The username
and password
fields are optional. Credentials can instead be configured in the runtime’s application.json
file.
Code Block | ||
---|---|---|
| ||
{ "sql-credentials": [ { "host": "host1.com", "username": "my_username", "password": "my_password" }, { "host": "host2.com", "username": {"from_env": "MY_ENV_VARIABLE"}, "password": {"from_file": "/config/the-password"} } ] } |
A connection string of scheme://host1.com:port/database
would use the host1.com
as the key to determining the username and password. Note that both the username and password can be specified directly in the config file, or looked for in an environment variable, or in a specified file. There is an additional from_file_from_env
option that indicates the value in question should be found in a file, the name of which is found in the specified environment variable.
The host
is a required field that indicates the hostname where the database lives.
The port
field is required. The only exception is when the driver is mysql
and the database is running on the default port of 3306.
The database
field indicates… the database. And it is required.
The SQL statement
When used as an input asset, the specified SQL statement will be executed once, and the runtime will iterate over returned rows, feeding them one at a time into the model. And example SQL statement:
select * from table2
Alternatively, ?
can be included in the statement to allow parameters specified in the SQL asset to be used:
select * from table2 where columnABC = ?
When used as an output asset, the specified SQL statement will be executed for each output record:
insert into table2 (str, i, bl) values (?, ?, ?)
The way values from emitted records are mapped into the output SQL statement is described in the next section.
Bulk inserts
When a SQL asset is being used as an output for a job, the runtime checks to see if the provided statement looks like a simple insert statement:
insert into tablename (columnA, columnB) values (?, ?);
If it matches the above pattern (with the column names being optional), then rather than execute the SQL statement for each output record, the runtime instead dynamically builds up a single large SQL statement that gets executed when the batch job is complete. In the above example, suppose you emitted 3 records with values of {a, b}, {c, d}, and {e, f}. The runtime will (re)generate the SQL statement to look like this:
insert into tablename (columnA, columnB) values (a, b), (c, d), (e, f);
The parameters
When used as an input asset, specified parameters are substituted into the specified SQL statement in the order they are presented. The number of ?
in the SQL statement must match the number of specified parameters. Additionally, what is specified is the value that should be fed into the SQL statement. So if the statement looks like this:
select * from table2 where columnABC = ?
… and the only specified parameter is my_value
, the SQL statement that gets executed will look like this:
select * from table2 where columnABC = 'my_value'
On the output side of things, the parameters specify the key to use to find the appropriate value in the output record. Consider the following SQL statement:
insert into table2 (alpha, beta) values (?, ?)
Consider parameters that specify aaa
and xyz
.
Consider an output record that looks like this:
Code Block | ||
---|---|---|
| ||
{ "aaa": "value1", "xyz": 123 } |
The SQL statement executed will have the values (value1
and 123
) inserted into the SQL statement in the order the parameters are specified:
insert into table2 (alpha, beta) values ('value1', 123)
SQL Setup - ModelOp Runtime
Prerequisites
The Centos or UBI/RHEL 8 image.
Runtime Configuration
No special runtime configuration is required.
Credentials Configuration
Credentials are mapped based on host name, not driver type, so credentials configuration is identical across all drivers.
It’s worth noting that databricks typically requires a personal access token, and so the username
credential must be set to token
and the password
credential should be set to the personal access token.
Redshift Support
Connections String:
Note, the Redshift connector returns data as Unicode character type.
To work around this, add UseUnicode=0 to the connection string so the connector returns data as regular SQL types
Connection String: redshift://redshift-cluster-1.ct8ouww8lud6.us-east-2.redshift.amazonaws.com:5439/dev;UseUnicode=0
Configuration Timeout: Redshift operates distinctly from traditional databases/data warehouses, and depending on the table structure, may require additional time to return results
By default, the ModelOp Runtime has an ODBC timeout set to 5 seconds. This value may need to be changed for Redshift.
To do so, the following timeout value should be configured in SCCS: modelop.runtime.odbc-timeout.
Data Types:
When importing data into Redshift, the data types are not the same as other SQL Databases, and as such, for some of the data may need to be changed.
For example: Redshift does not support bit. We changed this to varchar(50)
Source: https://docs.aws.amazon.com/redshift/latest/dg/c_Supported_data_types.html
Adding SQL Assets to a Model
Via the ModelOp UI
1. On the Business Model details page, click on the Assets tab.
2. Click “Add Asset” in the top right corner. The user is presented with multiple options:
3. Click “Add SQL Asset” button in the upper right hand corner.
4. Fill out the pop up form with the SQL asset details:
Running a Job using SQL Assets`
Via the ModelOp UI
1. In the “Create Job” wizard, when choosing the input and/or output assets, select SQL
from the dropdown box:
2. Next, fill out the form as follows, choosing the actual values you want to inject into the SQL String as the SQL Parameters:
When filling out the form for an output asset, the parameters chosen need to match the names of the keys in the emitted dictionary: