Overview
The spark-submit
execution process was designed and architected with the following principles and features in mind:
When possible,
Python
and/orPySpark
components should be implemented in python. This principle will ensure testing and debugging are easier when running in local or cluster deploy modeThe
spark-runtime-service
is monitoring jobs inWAITING
andRUNNING
state until they reach theirCOMPLETE
,ERROR
orCANCELLED
state. If any new job changes are detected, model manager will be updated accordinglyBefore a job is submitted to the
Spark
cluster for execution, theModelOpPySparkDriver
implementation will extract all the (external file assets) input, output and model assets from the ModelBatchJob and pass them in as parameters to the function to be executed. These external file assets are passed in as lists ofAssets
:external_inputs – [{“name“: “test.csv“, “assetType”: “EXTERNAL_FILE, “fileUrl”: “/hadoop/test.csv“}]
external_outputs – [{“name“: “output.csv”, “assetType”: “EXTERNAL_FILE“, “fileUrl”: “/hadoop/output.csv“}]
external_model_assets – [{“name“: “asset.zip”, “assetType”: “EXTERNAL_FILE“, “fileUrl”: “/hadoop/asset.zip“}]
The
YarnMonitorService
component should be listening for and storing injobMessage
thestdout
andstderr
produced by a job that is running at theSpark
cluster. Whether to store or not store,stdout
andstderr
injobMessages
, is configurable.
Architecture
This diagram represents how core components interact, and gives a high level overview of the process flow from a ModelBatchJob to a Spark job running at the cluster.
Details
PySpark dynamic job abstraction
ModelOp Center abstracts the process of executing PySpark
jobs using ModelBatchJob
as input. The approach uses a main PySpark Driver ( ModelOpPySparkDriver.py
) that imports
the job metadata (details about job input, output and model assets, as well as method name to be executed), and the model source code to be executed, and then just a simple main
that will be able to call the desired function.
PySparkJobManifest
ModelOp Center decouples the transformation process that takes a job as input and produces a SparkLauncher
as output. For this purpose, ModelOp Center uses the PySparkJobManifest
intermediate component. PySparkJobManifest
contains all the required values when running a spark-submit
: python source code and metadata, function name, application name (all of which are used inside ModelOpPySparkDriver
). This design allows for decoupling and keeping agnostic the spark-submit
.
The next diagram displays the transformation process of a ModelBatchJob into a SparkLauncher:
ModelOpJobMonitor
The ModelOpJobMonitor
has two main responsibilities:
Submitting jobs to
Spark
cluster.Monitoring job statuses for existing, waiting and running,
PySpark
jobs.
In order to successfully fulfill these responsibilities, it relies on two key services:
SparkLauncherService
– component in charge of translating jobs into SparkLauncher objectsYarnMonitorService
– component in charge of fetching job statuses and output produced from the Spark cluster.
PySpark - Metrics Jobs
Running metrics jobs is one of the key features of the spark-runtime-service
.
The ModelTestResult generation got moved out of SparkRuntimeService and now its performed by MLC.
In addition to metrics jobs, spark-runtime-service
can run scoring and training jobs, too.