Algorithms

SystemDS support different Machine learning algorithms out of the box.

As an example the lm algorithm can be used as follows:

# Import numpy and SystemDS matrix
import numpy as np
from systemds.context import SystemDSContext
from systemds.matrix import Matrix
from systemds.operator.algorithm import lm

# Set a seed
np.random.seed(0)
# Generate matrix of feature vectors
features = np.random.rand(10, 15)
# Generate a 1-column matrix of response values
y = np.random.rand(10, 1)

# compute the weights
with SystemDSContext() as sds:
  weights = lm(Matrix(sds, features), Matrix(sds, y)).compute()
  print(weights)

The output should be similar to:

[[-0.11538199]
[-0.20386541]
[-0.39956035]
[ 1.04078623]
[ 0.4327084 ]
[ 0.18954599]
[ 0.49858968]
[-0.26812763]
[ 0.09961844]
[-0.57000751]
[-0.43386048]
[ 0.55358873]
[-0.54638565]
[ 0.2205885 ]
[ 0.37957689]]
systemds.operator.algorithm.kmeans(x: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode

Performs KMeans on matrix input.

Parameters
  • x – Input dataset to perform K-Means on.

  • k – The number of centroids to use for the algorithm.

  • runs – The number of concurrent instances of K-Means to run (with different initial centroids).

  • max_iter – The maximum number of iterations to run the K-Means algorithm for.

  • eps – Tolerance for the algorithm to declare convergence using WCSS change ratio.

  • is_verbose – Boolean flag if the algorithm should be run in a verbose manner.

  • avg_sample_size_per_centroid – The average number of records per centroid in the data samples.

Returns

OperationNode List containing two outputs 1. the clusters, 2 the cluster ID associated with each row in x.

systemds.operator.algorithm.l2svm(x: systemds.operator.operation_node.OperationNode, y: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode

Perform L2SVM on matrix with labels given.

Parameters
  • x – Input dataset

  • y – Input labels in shape of one column

  • kwargs – Dictionary of extra arguments

Returns

OperationNode containing the model fit.

systemds.operator.algorithm.lm(x: systemds.operator.operation_node.OperationNode, y: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode

Performs LM on matrix with labels given.

Parameters
  • x – Input dataset

  • y – Input labels in shape of one column

  • kwargs – Dictionary of extra arguments

Returns

OperationNode containing the model fit.

systemds.operator.algorithm.multiLogReg(x: systemds.operator.operation_node.OperationNode, y: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode

Performs Multiclass Logistic Regression on the matrix input using Trust Region method.

See: Trust Region Newton Method for Logistic Regression, Lin, Weng and Keerthi, JMLR 9 (2008) 627-650)

Parameters
  • x – Input dataset to perform logstic regression on

  • y – Labels rowaligned with the input dataset

  • icpt – Intercept, default 2, Intercept presence, shifting and rescaling X columns: 0 = no intercept, no shifting, no rescaling; 1 = add intercept, but neither shift nor rescale X; 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1

  • tol – float tolerance for the algorithm.

  • reg – Regularization parameter (lambda = 1/C); intercept settings are not regularized.

  • maxi – Maximum outer iterations of the algorithm

  • maxii – Maximum inner iterations of the algorithm :return: OperationNode of a matrix containing the regression parameters trained.

systemds.operator.algorithm.multiLogRegPredict(x: systemds.operator.operation_node.OperationNode, b: systemds.operator.operation_node.OperationNode, y: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode

Performs prediction on input data x using the model trained, b.

Parameters
  • x – The data to perform classification on.

  • b – The regression parameters trained from multiLogReg.

  • y – The Labels expected to be contained in the X dataset, to calculate accuracy.

  • verbose – Boolean specifying if the prediction should be verbose.

Returns

OperationNode List containing three outputs. 1. The predicted means / probabilities 2. The predicted response vector 3. The scalar value of accuracy

systemds.operator.algorithm.pca(x: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode

Performs PCA on the matrix input

Parameters
  • x – Input dataset to perform Principal Componenet Analysis (PCA) on.

  • K – The number of reduced dimensions.

  • center – Boolean specifying if the input values should be centered.

  • scale – Boolean specifying if the input values should be scaled. :return: OperationNode List containing two outputs 1. The dimensionality reduced X input, 2. A matrix to reduce dimensionality similarly on unseen data.