Algorithms¶

SystemDS support different Machine learning algorithms out of the box.

As an example the lm algorithm can be used as follows:

# Import numpy and SystemDS matrix
import numpy as np
from systemds.context import SystemDSContext
from systemds.matrix import Matrix
from systemds.operator.algorithm import lm

# Set a seed
np.random.seed(0)
# Generate matrix of feature vectors
features = np.random.rand(10, 15)
# Generate a 1-column matrix of response values
y = np.random.rand(10, 1)

# compute the weights
with SystemDSContext() as sds:
  weights = lm(Matrix(sds, features), Matrix(sds, y)).compute()
  print(weights)

The output should be similar to:

[[-0.11538199]
[-0.20386541]
[-0.39956035]
[ 1.04078623]
[ 0.4327084 ]
[ 0.18954599]
[ 0.49858968]
[-0.26812763]
[ 0.09961844]
[-0.57000751]
[-0.43386048]
[ 0.55358873]
[-0.54638565]
[ 0.2205885 ]
[ 0.37957689]]

systemds.operator.algorithm.kmeans(x: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode¶

Performs KMeans on matrix input.

Parameters

x – Input dataset to perform K-Means on.
k – The number of centroids to use for the algorithm.
runs – The number of concurrent instances of K-Means to run (with different initial centroids).
max_iter – The maximum number of iterations to run the K-Means algorithm for.
eps – Tolerance for the algorithm to declare convergence using WCSS change ratio.
is_verbose – Boolean flag if the algorithm should be run in a verbose manner.
avg_sample_size_per_centroid – The average number of records per centroid in the data samples.

Returns

OperationNode List containing two outputs 1. the clusters, 2 the cluster ID associated with each row in x.

systemds.operator.algorithm.l2svm(x: systemds.operator.operation_node.OperationNode, y: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode¶

Perform L2SVM on matrix with labels given.

Parameters

x – Input dataset
y – Input labels in shape of one column
kwargs – Dictionary of extra arguments

Returns

OperationNode containing the model fit.

systemds.operator.algorithm.lm(x: systemds.operator.operation_node.OperationNode, y: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode¶

Performs LM on matrix with labels given.

Parameters

x – Input dataset
y – Input labels in shape of one column
kwargs – Dictionary of extra arguments

Returns

OperationNode containing the model fit.

systemds.operator.algorithm.multiLogReg(x: systemds.operator.operation_node.OperationNode, y: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode¶

Performs Multiclass Logistic Regression on the matrix input using Trust Region method.

See: Trust Region Newton Method for Logistic Regression, Lin, Weng and Keerthi, JMLR 9 (2008) 627-650)

Parameters

x – Input dataset to perform logstic regression on
y – Labels rowaligned with the input dataset
icpt – Intercept, default 2, Intercept presence, shifting and rescaling X columns: 0 = no intercept, no shifting, no rescaling; 1 = add intercept, but neither shift nor rescale X; 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
tol – float tolerance for the algorithm.
reg – Regularization parameter (lambda = 1/C); intercept settings are not regularized.
maxi – Maximum outer iterations of the algorithm
maxii – Maximum inner iterations of the algorithm :return: OperationNode of a matrix containing the regression parameters trained.

systemds.operator.algorithm.multiLogRegPredict(x: systemds.operator.operation_node.OperationNode, b: systemds.operator.operation_node.OperationNode, y: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode¶

Performs prediction on input data x using the model trained, b.

Parameters

x – The data to perform classification on.
b – The regression parameters trained from multiLogReg.
y – The Labels expected to be contained in the X dataset, to calculate accuracy.
verbose – Boolean specifying if the prediction should be verbose.

Returns

OperationNode List containing three outputs. 1. The predicted means / probabilities 2. The predicted response vector 3. The scalar value of accuracy

systemds.operator.algorithm.pca(x: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode¶

Performs PCA on the matrix input

Parameters

x – Input dataset to perform Principal Componenet Analysis (PCA) on.
K – The number of reduced dimensions.
center – Boolean specifying if the input values should be centered.
scale – Boolean specifying if the input values should be scaled. :return: OperationNode List containing two outputs 1. The dimensionality reduced X input, 2. A matrix to reduce dimensionality similarly on unseen data.