Algorithms¶
SystemDS support different Machine learning algorithms out of the box.
As an example the lm algorithm can be used as follows:
# Import numpy and SystemDS matrix
import numpy as np
from systemds.context import SystemDSContext
from systemds.matrix import Matrix
from systemds.operator.algorithm import lm
# Set a seed
np.random.seed(0)
# Generate matrix of feature vectors
features = np.random.rand(10, 15)
# Generate a 1-column matrix of response values
y = np.random.rand(10, 1)
# compute the weights
with SystemDSContext() as sds:
weights = lm(Matrix(sds, features), Matrix(sds, y)).compute()
print(weights)
The output should be similar to:
[[-0.11538199]
[-0.20386541]
[-0.39956035]
[ 1.04078623]
[ 0.4327084 ]
[ 0.18954599]
[ 0.49858968]
[-0.26812763]
[ 0.09961844]
[-0.57000751]
[-0.43386048]
[ 0.55358873]
[-0.54638565]
[ 0.2205885 ]
[ 0.37957689]]
-
systemds.operator.algorithm.
kmeans
(x: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode¶ Performs KMeans on matrix input.
- Parameters
x – Input dataset to perform K-Means on.
k – The number of centroids to use for the algorithm.
runs – The number of concurrent instances of K-Means to run (with different initial centroids).
max_iter – The maximum number of iterations to run the K-Means algorithm for.
eps – Tolerance for the algorithm to declare convergence using WCSS change ratio.
is_verbose – Boolean flag if the algorithm should be run in a verbose manner.
avg_sample_size_per_centroid – The average number of records per centroid in the data samples.
- Returns
OperationNode List containing two outputs 1. the clusters, 2 the cluster ID associated with each row in x.
-
systemds.operator.algorithm.
l2svm
(x: systemds.operator.operation_node.OperationNode, y: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode¶ Perform L2SVM on matrix with labels given.
- Parameters
x – Input dataset
y – Input labels in shape of one column
kwargs – Dictionary of extra arguments
- Returns
OperationNode containing the model fit.
-
systemds.operator.algorithm.
lm
(x: systemds.operator.operation_node.OperationNode, y: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode¶ Performs LM on matrix with labels given.
- Parameters
x – Input dataset
y – Input labels in shape of one column
kwargs – Dictionary of extra arguments
- Returns
OperationNode containing the model fit.
-
systemds.operator.algorithm.
multiLogReg
(x: systemds.operator.operation_node.OperationNode, y: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode¶ Performs Multiclass Logistic Regression on the matrix input using Trust Region method.
See: Trust Region Newton Method for Logistic Regression, Lin, Weng and Keerthi, JMLR 9 (2008) 627-650)
- Parameters
x – Input dataset to perform logstic regression on
y – Labels rowaligned with the input dataset
icpt – Intercept, default 2, Intercept presence, shifting and rescaling X columns: 0 = no intercept, no shifting, no rescaling; 1 = add intercept, but neither shift nor rescale X; 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
tol – float tolerance for the algorithm.
reg – Regularization parameter (lambda = 1/C); intercept settings are not regularized.
maxi – Maximum outer iterations of the algorithm
maxii – Maximum inner iterations of the algorithm :return: OperationNode of a matrix containing the regression parameters trained.
-
systemds.operator.algorithm.
multiLogRegPredict
(x: systemds.operator.operation_node.OperationNode, b: systemds.operator.operation_node.OperationNode, y: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode¶ Performs prediction on input data x using the model trained, b.
- Parameters
x – The data to perform classification on.
b – The regression parameters trained from multiLogReg.
y – The Labels expected to be contained in the X dataset, to calculate accuracy.
verbose – Boolean specifying if the prediction should be verbose.
- Returns
OperationNode List containing three outputs. 1. The predicted means / probabilities 2. The predicted response vector 3. The scalar value of accuracy
-
systemds.operator.algorithm.
pca
(x: systemds.operator.operation_node.OperationNode, **kwargs: Dict[str, Union[DAGNode, str, int, float, bool]]) → systemds.operator.operation_node.OperationNode¶ Performs PCA on the matrix input
- Parameters
x – Input dataset to perform Principal Componenet Analysis (PCA) on.
K – The number of reduced dimensions.
center – Boolean specifying if the input values should be centered.
scale – Boolean specifying if the input values should be scaled. :return: OperationNode List containing two outputs 1. The dimensionality reduced X input, 2. A matrix to reduce dimensionality similarly on unseen data.