Invoking SystemML in Spark Batch Mode


Overview

Given that a primary purpose of SystemML is to perform machine learning on large distributed data sets, one of the most important ways to invoke SystemML is Spark Batch. Here, we will look at this mode in more depth.

NOTE: For a programmatic API to run and interact with SystemML via Scala or Python, please see the Spark MLContext Programming Guide.


Spark Batch Mode Invocation Syntax

SystemML can be invoked in Hadoop Batch mode using the following syntax:

spark-submit SystemML.jar [-? | -help | -f <filename>] (-config=<config_filename>) ([-args | -nvargs] <args-list>)

The DML script to invoke is specified after the -f argument. Configuration settings can be passed to SystemML using the optional -config= argument. DML scripts can optionally take named arguments (-nvargs) or positional arguments (-args). Named arguments are preferred over positional arguments. Positional arguments are considered to be deprecated. All the primary algorithm scripts included with SystemML use named arguments.

Example #1: DML Invocation with Named Arguments

spark-submit systemml/SystemML.jar -f systemml/algorithms/Kmeans.dml -nvargs X=X.mtx k=5

Example #2: DML Invocation with Positional Arguments

spark-submit systemml/SystemML.jar -f example/test/LinearRegression.dml -args "v" "y" 0.00000001 "w"

Execution modes

SystemML works seamlessly with all Spark execution modes, including local (--master local[*]), yarn client (--master yarn-client), yarn cluster (--master yarn-cluster), etc. More information on Spark cluster execution modes can be found on the official Spark cluster deployment documentation. Note that Spark can be easily run on a laptop in local mode using the --master local[*] described above, which SystemML supports.

Recommended Spark Configuration Settings

For best performance, we recommend setting the following flags when running SystemML with Spark: --conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128.

Examples

Please see the MNIST examples in the included SystemML-NN library for examples of Spark Batch mode execution with SystemML to train MNIST classifiers: