Beginner's Guide to DML and PyDML
- Overview
- Script Invocation
- Data Types
- Matrix Basics
- Control Statements
- User-Defined Functions
- Command-Line Arguments and Default Values
- Additional Information
Overview
SystemML enables flexible, scalable machine learning. This flexibility is achieved through the specification of a high-level declarative machine learning language that comes in two flavors, one with an R-like syntax (DML) and one with a Python-like syntax (PyDML).
Algorithm scripts written in DML and PyDML can be run on Spark, on Hadoop, or in Standalone mode. SystemML also features an MLContext API that allows SystemML to be accessed via Scala or Python from a Spark Shell, a Jupyter Notebook, or a Zeppelin Notebook.
This Beginner’s Guide serves as a starting point for writing DML and PyDML scripts.
Script Invocation
DML and PyDML scripts can be invoked in a variety of ways. Suppose that we have hello.dml
and
hello.pydml
scripts containing the following:
print('hello ' + $1)
One way to begin working with SystemML is to download a binary distribution of SystemML
and use the runStandaloneSystemML.sh
and runStandaloneSystemML.bat
scripts to run SystemML in standalone
mode. The name of the DML or PyDML script is passed as the first argument to these scripts,
along with a variety of arguments. Note that PyDML invocation can be forced with the addition of a -python
flag.
./runStandaloneSystemML.sh hello.dml -args world
./runStandaloneSystemML.sh hello.pydml -args world
Data Types
SystemML has four value data types. In DML, these are: double, integer, string, and boolean. In PyDML, these are: float, int, str, and bool. In normal usage, the data type of a variable is implicit based on its value. Mathematical operations typically operate on doubles/floats, whereas integers/ints are typically useful for tasks such as iteration and accessing elements in a matrix.
aDouble = 3.0
bInteger = 2
print('aDouble = ' + aDouble)
print('bInteger = ' + bInteger)
print('aDouble + bInteger = ' + (aDouble + bInteger))
print('bInteger ^ 3 = ' + (bInteger ^ 3))
print('aDouble ^ 2 = ' + (aDouble ^ 2))
cBoolean = TRUE
print('cBoolean = ' + cBoolean)
print('(2 < 1) = ' + (2 < 1))
dString = 'Hello'
eString = dString + ' World'
print('dString = ' + dString)
print('eString = ' + eString)
aFloat = 3.0
bInt = 2
print('aFloat = ' + aFloat)
print('bInt = ' + bInt)
print('aFloat + bInt = ' + (aFloat + bInt))
print('bInt ** 3 = ' + (bInt ** 3))
print('aFloat ** 2 = ' + (aFloat ** 2))
cBool = True
print('cBool = ' + cBool)
print('(2 < 1) = ' + (2 < 1))
dStr = 'Hello'
eStr = dStr + ' World'
print('dStr = ' + dStr)
print('eStr = ' + eStr)
aDouble = 3.0
bInteger = 2
aDouble + bInteger = 5.0
bInteger ^ 3 = 8.0
aDouble ^ 2 = 9.0
cBoolean = TRUE
(2 < 1) = FALSE
dString = Hello
eString = Hello World
aFloat = 3.0
bInt = 2
aFloat + bInt = 5.0
bInt ** 3 = 8.0
aFloat ** 2 = 9.0
cBool = TRUE
(2 < 1) = FALSE
dStr = Hello
eStr = Hello World
Matrix Basics
Creating a Matrix
A matrix can be created in DML using the matrix()
function and in PyDML using the full()
function. In the example below, a matrix element is still considered to be of the matrix data type,
so the value is cast to a scalar in order to print it. Matrix element values are of type double/float.
m = matrix("1 2 3 4 5 6 7 8 9 10 11 12", rows=4, cols=3)
for (i in 1:nrow(m)) {
for (j in 1:ncol(m)) {
n = m[i,j]
print('[' + i + ',' + j + ']:' + as.scalar(n))
}
}
m = full("1 2 3 4 5 6 7 8 9 10 11 12", rows=4, cols=3)
for (i in 0:nrow(m)-1):
for (j in 0:ncol(m)-1):
n = m[i,j]
print('[' + i + ',' + j + ']:' + scalar(n))
[1,1]:1.0
[1,2]:2.0
[1,3]:3.0
[2,1]:4.0
[2,2]:5.0
[2,3]:6.0
[3,1]:7.0
[3,2]:8.0
[3,3]:9.0
[4,1]:10.0
[4,2]:11.0
[4,3]:12.0
[0,0]:1.0
[0,1]:2.0
[0,2]:3.0
[1,0]:4.0
[1,1]:5.0
[1,2]:6.0
[2,0]:7.0
[2,1]:8.0
[2,2]:9.0
[3,0]:10.0
[3,1]:11.0
[3,2]:12.0
We can also output the matrix element values using the toString
function:
m = matrix("1 2 3 4 5 6 7 8 9 10 11 12", rows=4, cols=3)
print(toString(m, sep=" | ", decimal=1))
m = full("1 2 3 4 5 6 7 8 9 10 11 12", rows=4, cols=3)
print(toString(m, sep=" | ", decimal=1))
1.0 | 2.0 | 3.0
4.0 | 5.0 | 6.0
7.0 | 8.0 | 9.0
10.0 | 11.0 | 12.0
For additional information about the matrix()
and full()
functions, please see the
Matrix Construction
section of the Language Reference. For information about the toString()
function, see
the Other Built-In Functions section of the Language Reference.
Saving a Matrix
A matrix can be saved using the write()
function in DML and the save()
function in PyDML. SystemML supports four
different formats: text
(i,j,v
), mm
(Matrix Market
), csv
(delimiter-separated values
), and binary
.
m = matrix("1 2 3 0 0 0 7 8 9 0 0 0", rows=4, cols=3)
write(m, "m.txt", format="text")
write(m, "m.mm", format="mm")
write(m, "m.csv", format="csv")
write(m, "m.binary", format="binary")
m = full("1 2 3 0 0 0 7 8 9 0 0 0", rows=4, cols=3)
save(m, "m.txt", format="text")
save(m, "m.mm", format="mm")
save(m, "m.csv", format="csv")
save(m, "m.binary", format="binary")
Saving a matrix automatically creates a metadata file for each format except for Matrix Market, since Matrix Market contains
metadata within the *.mm file. All formats are text-based except binary. The contents of the resulting files are shown here.
Note that the text
(i,j,v
) and mm
(Matrix Market
) formats index from 1, even when working with PyDML, which
is 0-based.
1 1 1.0
1 2 2.0
1 3 3.0
3 1 7.0
3 2 8.0
3 3 9.0
{
"data_type": "matrix",
"value_type": "double",
"rows": 4,
"cols": 3,
"nnz": 6,
"format": "text",
"author": "SystemML",
"created": "2017-01-01 00:00:01 PST"
}
%%MatrixMarket matrix coordinate real general
4 3 6
1 1 1.0
1 2 2.0
1 3 3.0
3 1 7.0
3 2 8.0
3 3 9.0
1.0,2.0,3.0
0,0,0
7.0,8.0,9.0
0,0,0
{
"data_type": "matrix",
"value_type": "double",
"rows": 4,
"cols": 3,
"nnz": 6,
"format": "csv",
"header": false,
"sep": ",",
"author": "SystemML",
"created": "2017-01-01 00:00:01 PST"
}
Not text-based
{
"data_type": "matrix",
"value_type": "double",
"rows": 4,
"cols": 3,
"rows_in_block": 1000,
"cols_in_block": 1000,
"nnz": 6,
"format": "binary",
"author": "SystemML",
"created": "2017-01-01 00:00:01 PST"
}
Loading a Matrix
A matrix can be loaded using the read()
function in DML and the load()
function in PyDML. As with saving, SystemML supports four
formats: text
(i,j,v
), mm
(Matrix Market
), csv
(delimiter-separated values
), and binary
. To read a file, a corresponding
metadata file is required, except for the Matrix Market format. A metadata file is not required if a format
parameter is specified to the read()
or load()
functions.
m = read("m.csv")
print("min:" + min(m))
print("max:" + max(m))
print("sum:" + sum(m))
mRowSums = rowSums(m)
for (i in 1:nrow(mRowSums)) {
print("row " + i + " sum:" + as.scalar(mRowSums[i,1]))
}
mColSums = colSums(m)
for (i in 1:ncol(mColSums)) {
print("col " + i + " sum:" + as.scalar(mColSums[1,i]))
}
m = load("m.csv")
print("min:" + min(m))
print("max:" + max(m))
print("sum:" + sum(m))
mRowSums = rowSums(m)
for (i in 0:nrow(mRowSums)-1):
print("row " + i + " sum:" + scalar(mRowSums[i,0]))
mColSums = colSums(m)
for (i in 0:ncol(mColSums)-1):
print("col " + i + " sum:" + scalar(mColSums[0,i]))
min:0.0
max:9.0
sum:30.0
row 1 sum:6.0
row 2 sum:0.0
row 3 sum:24.0
row 4 sum:0.0
col 1 sum:8.0
col 2 sum:10.0
col 3 sum:12.0
min:0.0
max:9.0
sum:30.0
row 0 sum:6.0
row 1 sum:0.0
row 2 sum:24.0
row 3 sum:0.0
col 0 sum:8.0
col 1 sum:10.0
col 2 sum:12.0
Matrix Operations
DML and PyDML offer a rich set of operators and built-in functions to perform various operations on matrices and scalars. Operators and built-in functions are described in great detail in the Language Reference (Expressions, Built-In Functions).
In this example, we create a matrix A. Next, we create another matrix B by adding 4 to each element in A. Next, we flip B by taking its transpose. We then multiply A and B, represented by matrix C. We create a matrix D with the same number of rows and columns as C, and initialize its elements to 5. We then subtract D from C and divide the values of its elements by 2 and assign the resulting matrix to D.
A = matrix("1 2 3 4 5 6", rows=3, cols=2)
print(toString(A))
B = A + 4
B = t(B)
print(toString(B))
C = A %*% B
print(toString(C))
D = matrix(5, rows=nrow(C), cols=ncol(C))
D = (C - D) / 2
print(toString(D))
A = full("1 2 3 4 5 6", rows=3, cols=2)
print(toString(A))
B = A + 4
B = transpose(B)
print(toString(B))
C = dot(A, B)
print(toString(C))
D = full(5, rows=nrow(C), cols=ncol(C))
D = (C - D) / 2
print(toString(D))
1.000 2.000
3.000 4.000
5.000 6.000
5.000 7.000 9.000
6.000 8.000 10.000
17.000 23.000 29.000
39.000 53.000 67.000
61.000 83.000 105.000
6.000 9.000 12.000
17.000 24.000 31.000
28.000 39.000 50.000
Matrix Indexing
The elements in a matrix can be accessed by their row and column indices. In the example below, we have 3x3 matrix A. First, we access the element at the third row and third column. Next, we obtain a row slice (vector) of the matrix by specifying the row and leaving the column blank. We obtain a column slice (vector) by leaving the row blank and specifying the column. After that, we obtain a submatrix via range indexing, where we specify rows, separated by a colon, and columns, separated by a colon.
A = matrix("1 2 3 4 5 6 7 8 9", rows=3, cols=3)
print(toString(A))
B = A[3,3]
print(toString(B))
C = A[2,]
print(toString(C))
D = A[,3]
print(toString(D))
E = A[2:3,1:2]
print(toString(E))
A = full("1 2 3 4 5 6 7 8 9", rows=3, cols=3)
print(toString(A))
B = A[2,2]
print(toString(B))
C = A[1,]
print(toString(C))
D = A[,2]
print(toString(D))
E = A[1:3,0:2]
print(toString(E))
1.000 2.000 3.000
4.000 5.000 6.000
7.000 8.000 9.000
9.000
4.000 5.000 6.000
3.000
6.000
9.000
4.000 5.000
7.000 8.000
Control Statements
DML and PyDML both feature if
, if-else
, and if-else-if
conditional statements.
DML and PyDML feature 3 loop statements: while
, for
, and parfor
(parallel for). In the example, note that the
print
statements within the parfor
loop can occur in any order since the iterations occur in parallel rather than
sequentially as in a regular for
loop. The parfor
statement can include several optional parameters, as described
in the Language Reference (ParFor Statement).
i = 1
while (i <= 3) {
if (i == 1) {
print('hello')
} else if (i == 2) {
print('world')
} else {
print('!!!')
}
i = i + 1
}
A = matrix("1 2 3 4 5 6", rows=3, cols=2)
for (i in 1:nrow(A)) {
print("for A[" + i + ",1]:" + as.scalar(A[i,1]))
}
parfor(i in 1:nrow(A)) {
print("parfor A[" + i + ",1]:" + as.scalar(A[i,1]))
}
i = 1
while (i <= 3):
if (i == 1):
print('hello')
elif (i == 2):
print('world')
else:
print('!!!')
i = i + 1
A = full("1 2 3 4 5 6", rows=3, cols=2)
for (i in 0:nrow(A)-1):
print("for A[" + i + ",0]:" + scalar(A[i,0]))
parfor(i in 0:nrow(A)-1):
print("parfor A[" + i + ",0]:" + scalar(A[i,0]))
hello
world
!!!
for A[1,1]:1.0
for A[2,1]:3.0
for A[3,1]:5.0
parfor A[2,1]:3.0
parfor A[1,1]:1.0
parfor A[3,1]:5.0
hello
world
!!!
for A[0,0]:1.0
for A[1,0]:3.0
for A[2,0]:5.0
parfor A[0,0]:1.0
parfor A[2,0]:5.0
parfor A[1,0]:3.0
User-Defined Functions
Functions encapsulate useful functionality in SystemML. In addition to built-in functions, users can define their own functions. Functions take 0 or more parameters and return 0 or more values.
doSomething = function(matrix[double] mat) return (matrix[double] ret) {
additionalCol = matrix(1, rows=nrow(mat), cols=1) # 1x3 matrix with 1 values
ret = cbind(mat, additionalCol) # concatenate column to matrix
ret = cbind(ret, seq(0, 2, 1)) # concatenate column (0,1,2) to matrix
ret = cbind(ret, rowMaxs(ret)) # concatenate column of max row values to matrix
ret = cbind(ret, rowSums(ret)) # concatenate column of row sums to matrix
}
A = rand(rows=3, cols=2, min=0, max=2) # random 3x2 matrix with values 0 to 2
B = doSomething(A)
write(A, "A.csv", format="csv")
write(B, "B.csv", format="csv")
def doSomething(mat: matrix[float]) -> (ret: matrix[float]):
additionalCol = full(1, rows=nrow(mat), cols=1) # 1x3 matrix with 1 values
ret = cbind(mat, additionalCol) # concatenate column to matrix
ret = cbind(ret, seq(0, 2, 1)) # concatenate column (0,1,2) to matrix
ret = cbind(ret, rowMaxs(ret)) # concatenate column of max row values to matrix
ret = cbind(ret, rowSums(ret)) # concatenate column of row sums to matrix
A = rand(rows=3, cols=2, min=0, max=2) # random 3x2 matrix with values 0 to 2
B = doSomething(A)
save(A, "A.csv", format="csv")
save(B, "B.csv", format="csv")
In the above example, a 3x2 matrix of random doubles between 0 and 2 is created using the rand()
function.
Additional parameters can be passed to rand()
to control sparsity and other matrix characteristics.
Matrix A is passed to the doSomething
function. A column of 1 values is concatenated to the matrix. A column
consisting of the values (0, 1, 2)
is concatenated to the matrix. Next, a column consisting of the maximum row values
is concatenated to the matrix. A column consisting of the row sums is concatenated to the matrix, and this resulting
matrix is returned to variable B. Matrix A is output to the A.csv
file and matrix B is saved as the B.csv
file.
1.6091961493071,0.7088614208099939
0.5984862383600267,1.5732118950764993
0.2947607068519842,1.9081406573366781
1.6091961493071,0.7088614208099939,1.0,0,1.6091961493071,4.927253719424194
0.5984862383600267,1.5732118950764993,1.0,1.0,1.5732118950764993,5.744910028513026
0.2947607068519842,1.9081406573366781,1.0,2.0,2.0,7.202901364188662
Command-Line Arguments and Default Values
Command-line arguments can be passed to DML and PyDML scripts either as named arguments or as positional arguments. Named
arguments are the preferred technique. Named arguments can be passed utilizing the -nvargs
switch, and positional arguments
can be passed using the -args
switch.
Default values can be set using the ifdef()
function.
In the example below, a matrix is read from the file system using named argument M
. The number of rows to print is specified
using the rowsToPrint
argument, which defaults to 2 if no argument is supplied. Likewise, the number of columns is
specified using colsToPrint
with a default value of 2.
fileM = $M
numRowsToPrint = ifdef($rowsToPrint, 2) # default to 2
numColsToPrint = ifdef($colsToPrint, 2) # default to 2
m = read(fileM)
for (i in 1:numRowsToPrint) {
for (j in 1:numColsToPrint) {
print('[' + i + ',' + j + ']:' + as.scalar(m[i,j]))
}
}
fileM = $M
numRowsToPrint = ifdef($rowsToPrint, 2) # default to 2
numColsToPrint = ifdef($colsToPrint, 2) # default to 2
m = load(fileM)
for (i in 0:numRowsToPrint-1):
for (j in 0:numColsToPrint-1):
print('[' + i + ',' + j + ']:' + scalar(m[i,j]))
Example #1 Arguments:
-f ex.dml -nvargs M=m.csv rowsToPrint=1 colsToPrint=3
Example #1 Results:
[1,1]:1.0
[1,2]:2.0
[1,3]:3.0
Example #2 Arguments:
-f ex.dml -nvargs M=m.csv
Example #2 Results:
[1,1]:1.0
[1,2]:2.0
[2,1]:0.0
[2,2]:0.0
Example #1 Arguments:
-f ex.pydml -nvargs M=m.csv rowsToPrint=1 colsToPrint=3
Example #1 Results:
[0,0]:1.0
[0,1]:2.0
[0,2]:3.0
Example #2 Arguments:
-f ex.pydml -nvargs M=m.csv
Example #2 Results:
[0,0]:1.0
[0,1]:2.0
[1,0]:0.0
[1,1]:0.0
Here, we see identical functionality but with positional arguments.
fileM = $1
numRowsToPrint = ifdef($2, 2) # default to 2
numColsToPrint = ifdef($3, 2) # default to 2
m = read(fileM)
for (i in 1:numRowsToPrint) {
for (j in 1:numColsToPrint) {
print('[' + i + ',' + j + ']:' + as.scalar(m[i,j]))
}
}
fileM = $1
numRowsToPrint = ifdef($2, 2) # default to 2
numColsToPrint = ifdef($3, 2) # default to 2
m = load(fileM)
for (i in 0:numRowsToPrint-1):
for (j in 0:numColsToPrint-1):
print('[' + i + ',' + j + ']:' + scalar(m[i,j]))
Example #1 Arguments:
-f ex.dml -args m.csv 1 3
Example #1 Results:
[1,1]:1.0
[1,2]:2.0
[1,3]:3.0
Example #2 Arguments:
-f ex.dml -args m.csv
Example #2 Results:
[1,1]:1.0
[1,2]:2.0
[2,1]:0.0
[2,2]:0.0
Example #1 Arguments:
-f ex.pydml -args m.csv 1 3
Example #1 Results:
[0,0]:1.0
[0,1]:2.0
[0,2]:3.0
Example #2 Arguments:
-f ex.pydml -args m.csv
Example #2 Results:
[0,0]:1.0
[0,1]:2.0
[1,0]:0.0
[1,1]:0.0
Additional Information
The Language Reference contains highly detailed information regarding DML.
In addition, many excellent examples can be found in the scripts
directory.