Algorithms

SystemDS support different Machine learning algorithms out of the box.

As an example the lm algorithm can be used as follows:

# Import numpy and SystemDS
import numpy as np
from systemds.context import SystemDSContext
from systemds.operator.algorithm import lm

# Set a seed
np.random.seed(0)
# Generate matrix of feature vectors
features = np.random.rand(10, 15)
# Generate a 1-column matrix of response values
y = np.random.rand(10, 1)

# compute the weights
with SystemDSContext() as sds:
  weights = lm(sds.from_numpy(features), sds.from_numpy(y)).compute()
  print(weights)

The output should be similar to

[[-0.11538199]
[-0.20386541]
[-0.39956035]
[ 1.04078623]
[ 0.4327084 ]
[ 0.18954599]
[ 0.49858968]
[-0.26812763]
[ 0.09961844]
[-0.57000751]
[-0.43386048]
[ 0.55358873]
[-0.54638565]
[ 0.2205885 ]
[ 0.37957689]]
systemds.operator.algorithm.WoE(X: Matrix, Y: Matrix, mask: Matrix)

function Weight of evidence / information gain

Parameters:
  • X

  • Y

  • mask

Returns:

Weighted X matrix where the entropy mask is applied

Returns:

A entropy matrix to apply to data

systemds.operator.algorithm.WoEApply(X: Matrix, Y: Matrix, entropyMatrix: Matrix)

function Weight of evidence / information gain apply on new data

Parameters:
  • X

  • Y

  • entropyMatrix

Returns:

Weighted X matrix where the entropy mask is applied

systemds.operator.algorithm.abstain(X: Matrix, Y: Matrix, threshold: float, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This function calls the multiLogReg-function in which solves Multinomial Logistic Regression using Trust Region method

Parameters:
  • X – matrix of feature vectors

  • Y – matrix with category labels

  • threshold – threshold to clear otherwise return X and Y unmodified

  • verbose – flag specifying if logging information should be printed

Returns:

abstained output X

Returns:

abstained output Y

systemds.operator.algorithm.als(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This script computes an approximate factorization of a low-rank matrix X into two matrices U and V using different implementations of the Alternating-Least-Squares (ALS) algorithm. Matrices U and V are computed by minimizing a loss function (with regularization).

Parameters:
  • X – Location to read the input matrix X to be factorized

  • rank – Rank of the factorization

  • regType – Regularization: “L2” = L2 regularization; f (U, V) = 0.5 * sum (W * (U %*% V - X) ^ 2) + 0.5 * reg * (sum (U ^ 2) + sum (V ^ 2)) “wL2” = weighted L2 regularization f (U, V) = 0.5 * sum (W * (U %*% V - X) ^ 2) + 0.5 * reg * (sum (U ^ 2 * row_nonzeros) + sum (V ^ 2 * col_nonzeros))

  • reg – Regularization parameter, no regularization if 0.0

  • maxi – Maximum number of iterations

  • check – Check for convergence after every iteration, i.e., updating U and V once

  • thr – Assuming check is set to TRUE, the algorithm stops and convergence is declared if the decrease in loss in any two consecutive iterations falls below this threshold; if check is FALSE thr is ignored

  • seed – The seed to random parts of the algorithm

  • verbose – If the algorithm should run verbosely

Returns:

An m x r matrix where r is the factorization rank

Returns:

An m x r matrix where r is the factorization rank

systemds.operator.algorithm.alsCG(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This script computes an approximate factorization of a low-rank matrix X into two matrices U and V using the Alternating-Least-Squares (ALS) algorithm with conjugate gradient. Matrices U and V are computed by minimizing a loss function (with regularization).

Parameters:
  • X – Location to read the input matrix X to be factorized

  • rank – Rank of the factorization

  • regType – Regularization: “L2” = L2 regularization; f (U, V) = 0.5 * sum (W * (U %*% V - X) ^ 2) + 0.5 * reg * (sum (U ^ 2) + sum (V ^ 2)) “wL2” = weighted L2 regularization f (U, V) = 0.5 * sum (W * (U %*% V - X) ^ 2) + 0.5 * reg * (sum (U ^ 2 * row_nonzeros) + sum (V ^ 2 * col_nonzeros))

  • reg – Regularization parameter, no regularization if 0.0

  • maxi – Maximum number of iterations

  • check – Check for convergence after every iteration, i.e., updating U and V once

  • thr – Assuming check is set to TRUE, the algorithm stops and convergence is declared if the decrease in loss in any two consecutive iterations falls below this threshold; if check is FALSE thr is ignored

  • seed – The seed to random parts of the algorithm

  • verbose – If the algorithm should run verbosely

Returns:

An m x r matrix where r is the factorization rank

Returns:

An m x r matrix where r is the factorization rank

systemds.operator.algorithm.alsDS(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Alternating-Least-Squares (ALS) algorithm using a direct solve method for individual least squares problems (reg=”L2”). This script computes an approximate factorization of a low-rank matrix V into two matrices L and R. Matrices L and R are computed by minimizing a loss function (with regularization).

Parameters:
  • X – Location to read the input matrix V to be factorized

  • rank – Rank of the factorization

  • reg – Regularization parameter, no regularization if 0.0

  • maxi – Maximum number of iterations

  • check – Check for convergence after every iteration, i.e., updating L and R once

  • thr – Assuming check is set to TRUE, the algorithm stops and convergence is declared if the decrease in loss in any two consecutive iterations falls below this threshold; if check is FALSE thr is ignored

  • seed – The seed to random parts of the algorithm

  • verbose – If the algorithm should run verbosely

Returns:

An m x r matrix where r is the factorization rank

Returns:

An m x r matrix where r is the factorization rank

systemds.operator.algorithm.alsPredict(userIDs: Matrix, I: Matrix, L: Matrix, R: Matrix)

This script computes the rating/scores for a given list of userIDs using 2 factor matrices L and R. We assume that all users have rates at least once and all items have been rates at least once.

Parameters:
  • userIDs – Column vector of user-ids (n x 1)

  • I – Indicator matrix user-id x user-id to exclude from scoring

  • L – The factor matrix L: user-id x feature-id

  • R – The factor matrix R: feature-id x item-id

Returns:

The output user-id/item-id/score#

systemds.operator.algorithm.alsTopkPredict(userIDs: Matrix, I: Matrix, L: Matrix, R: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This script computes the top-K rating/scores for a given list of userIDs using 2 factor matrices L and R. We assume that all users have rates at least once and all items have been rates at least once.

Parameters:
  • userIDs – Column vector of user-ids (n x 1)

  • I – Indicator matrix user-id x user-id to exclude from scoring

  • L – The factor matrix L: user-id x feature-id

  • R – The factor matrix R: feature-id x item-id

  • K – The number of top-K items

Returns:

A matrix containing the top-K item-ids with highest predicted ratings for the specified users (rows)

Returns:

A matrix containing the top-K predicted ratings for the specified users (rows)

systemds.operator.algorithm.apply_pipeline(testData: Frame, pip: Frame, applyFunc: Frame, hp: Matrix, exState: List, iState: List, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This script will read the dirty and clean data, then it will apply the best pipeline on dirty data and then will classify both cleaned dataset and check if the cleaned dataset is performing same as original dataset in terms of classification accuracy

Parameters:
  • trainData

  • testData

  • metaData

  • lp

  • pip

  • hp

  • evaluationFunc

  • evalFunHp

  • isLastLabel

  • correctTypos

Returns:

systemds.operator.algorithm.arima(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Builtin function that implements ARIMA

Parameters:
  • X – The input Matrix to apply Arima on.

  • max_func_invoc

  • p – non-seasonal AR order

  • d – non-seasonal differencing order

  • q – non-seasonal MA order

  • P – seasonal AR order

  • D – seasonal differencing order

  • Q – seasonal MA order

  • s – period in terms of number of time-steps

  • include_mean – center to mean 0, and include in result

  • solver – solver, is either “cg” or “jacobi”

Returns:

The calculated coefficients

systemds.operator.algorithm.autoencoder_2layer(X: Matrix, num_hidden1: int, num_hidden2: int, max_epochs: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Trains a 2-layer autoencoder with minibatch SGD and step-size decay. If invoked with H1 > H2 then it becomes a ‘bowtie’ structured autoencoder Weights are initialized using Glorot & Bengio (2010) AISTATS initialization. The script standardizes the input before training (can be turned off). Also, it randomly reshuffles rows before training. Currently, tanh is set to be the activation function. By re-implementing ‘func’ DML-bodied function, one can change the activation.

Parameters:
  • X – Filename where the input is stored

  • num_hidden1 – Number of neurons in the 1st hidden layer

  • num_hidden2 – Number of neurons in the 2nd hidden layer

  • max_epochs – Number of epochs to train for

  • full_obj – If TRUE, Computes objective function value (squared-loss) at the end of each epoch. Note that, computing the full objective can take a lot of time.

  • batch_size – Mini-batch size (training parameter)

  • step – Initial step size (training parameter)

  • decay – Decays step size after each epoch (training parameter)

  • mu – Momentum parameter (training parameter)

  • W1_rand – Weights might be initialized via input matrices

  • W2_rand

  • W3_rand

  • W4_rand

Returns:

Matrix storing weights between input layer and 1st hidden layer

Returns:

Matrix storing bias between input layer and 1st hidden layer

Returns:

Matrix storing weights between 1st hidden layer and 2nd hidden layer

Returns:

Matrix storing bias between 1st hidden layer and 2nd hidden layer

Returns:

Matrix storing weights between 2nd hidden layer and 3rd hidden layer

Returns:

Matrix storing bias between 2nd hidden layer and 3rd hidden layer

Returns:

Matrix storing weights between 3rd hidden layer and output layer

Returns:

Matrix storing bias between 3rd hidden layer and output layer

Returns:

Matrix storing the hidden (2nd) layer representation if needed

systemds.operator.algorithm.bandit(X_train: Matrix, Y_train: Matrix, X_test: Matrix, Y_test: Matrix, metaList: List, evaluationFunc: str, evalFunHp: Matrix, lp: Frame, lpHp: Matrix, primitives: Frame, param: Frame, baseLineScore: float, cv: bool, **kwargs: Dict[str, DAGNode | str | int | float | bool])

In The bandit function the objective is to find an arm that optimizes a known functional of the unknown arm-reward distributions.

Parameters:
  • X_train

  • Y_train

  • X_test

  • Y_test

  • metaList

  • evaluationFunc

  • evalFunHp

  • lp

  • primitives

  • params

  • K

  • R

  • baseLineScore

  • cv

  • cvk

  • verbose

  • output

Returns:

systemds.operator.algorithm.bivar(X: Matrix, S1: Matrix, S2: Matrix, T1: Matrix, T2: Matrix, verbose: bool)

For a given pair of attribute sets, compute bivariate statistics between all attribute pairs. Given, index1 = {A_11, A_12, … A_1m} and index2 = {A_21, A_22, … A_2n} compute bivariate stats for m*n pairs (A_1i, A_2j), (1<= i <=m) and (1<= j <=n).

Parameters:
  • X – Input matrix

  • S1 – First attribute set {A_11, A_12, … A_1m}

  • S2 – Second attribute set {A_21, A_22, … A_2n}

  • T1 – Kind for attributes in S1 (kind=1 for scale, kind=2 for nominal, kind=3 for ordinal)

  • verbose – Print bivar stats

Returns:

basestats_scale_scale as output with bivar stats

Returns:

basestats_nominal_scale as output with bivar stats

Returns:

basestats_nominal_nominal as output with bivar stats

Returns:

basestats_ordinal_ordinal as output with bivar stats

systemds.operator.algorithm.components(G: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Computes the connected components of a graph and returns a vector indicating the assignment of vertices to components, where each component is identified by the maximum vertex ID (i.e., row/column position of the input graph)

Parameters:
  • X – Location to read the matrix of feature vectors

  • Y – Location to read the matrix with category labels

  • icpt – Intercept presence, shifting and rescaling X columns: 0 = no intercept, no shifting, no rescaling; 1 = add intercept, but neither shift nor rescale X; 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1

  • tol – tolerance (“epsilon”)

  • reg – regularization parameter (lambda = 1/C); intercept is not regularized

  • maxi – max. number of outer (Newton) iterations

  • maxii – max. number of inner (conjugate gradient) iterations, 0 = no max

  • verbose – flag specifying if logging information should be printed

Returns:

regression betas as output for prediction

systemds.operator.algorithm.confusionMatrix(P: Matrix, Y: Matrix)

Accepts a vector for prediction and a one-hot-encoded matrix Then it computes the max value of each vector and compare them After which, it calculates and returns the sum of classifications and the average of each true class.

                True Labels
                  1    2
              1   TP | FP
Predictions      ----+----
              2   FN | TN
Parameters:
  • P – vector of Predictions

  • Y – vector of Golden standard One Hot Encoded; the one hot encoded vector of actual labels

Returns:

The Confusion Matrix Sums of classifications

Returns:

The Confusion Matrix averages of each true class

systemds.operator.algorithm.cor(X: Matrix)

This Function compute correlation matrix

Parameters:

X – A Matrix Input to compute the correlation on

Returns:

Correlation matrix of the input matrix

systemds.operator.algorithm.correctTypos(strings: Frame, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Corrects corrupted frames of strings This algorithm operates on the assumption that most strings are correct and simply swaps strings that do not occur often with similar strings that occur more often

References:
Fred J. Damerau. 1964. 
  A technique for computer detection and correction of spelling errors. 
  Commun. ACM 7, 3 (March 1964), 171–176. 
  DOI:https://doi.org/10.1145/363958.363994
Parameters:
  • strings – The nx1 input frame of corrupted strings

  • frequency_threshold – Strings that occur above this frequency level will not be corrected

  • distance_threshold – Max distance at which strings are considered similar

  • is_verbose – Print debug information

Returns:

Corrected nx1 output frame

systemds.operator.algorithm.correctTyposApply(strings: Frame, distance_matrix: Matrix, dict: Frame, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Corrects corrupted frames of strings This algorithm operates on the assumption that most strings are correct and simply swaps strings that do not occur often with similar strings that occur more often

References:
Fred J. Damerau. 1964. 
  A technique for computer detection and correction of spelling errors. 
  Commun. ACM 7, 3 (March 1964), 171–176. 
  DOI:https://doi.org/10.1145/363958.363994

TODO: future: add parameter for list of words that are sure to be correct

Parameters:
  • strings – The nx1 input frame of corrupted strings

  • nullMask

  • frequency_threshold – Strings that occur above this frequency level will not be corrected

  • distance_threshold – Max distance at which strings are considered similar

  • matrix (distance) –

  • dict

Returns:

Corrected nx1 output frame

systemds.operator.algorithm.cox(X: Matrix, TE: Matrix, F: Matrix, R: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This script fits a cox Proportional hazard regression model. The Breslow method is used for handling ties and the regression parameters are computed using trust region newton method with conjugate gradient

Parameters:
  • X – Location to read the input matrix X containing the survival data containing the following information 1: timestamps 2: whether an event occurred (1) or data is censored (0) 3: feature vectors

  • TE – Column indices of X as a column vector which contain timestamp (first row) and event information (second row)

  • F – Column indices of X as a column vector which are to be used for fitting the Cox model

  • R – If factors (categorical variables) are available in the input matrix X, location to read matrix R containing the start and end indices of the factors in X R[,1]: start indices R[,2]: end indices Alternatively, user can specify the indices of the baseline level of each factor which needs to be removed from X; in this case the start and end indices corresponding to the baseline level need to be the same; if R is not provided by default all variables are considered to be continuous

  • alpha – Parameter to compute a 100*(1-alpha)% confidence interval for the betas

  • tol – Tolerance (“epsilon”)

  • moi – Max. number of outer (Newton) iterations

  • mii – Max. number of inner (conjugate gradient) iterations, 0 = no max

Returns:

A D x 7 matrix M, where D denotes the number of covariates, with the following schema: M[,1]: betas M[,2]: exp(betas) M[,3]: standard error of betas M[,4]: Z M[,5]: P-value M[,6]: lower 100*(1-alpha)% confidence interval of betas M[,7]: upper 100*(1-alpha)% confidence interval of betas

Returns:

Two matrices containing a summary of some statistics of the fitted model: 1 - File S with the following format - row 1: no. of observations - row 2: no. of events - row 3: log-likelihood - row 4: AIC - row 5: Rsquare (Cox & Snell) - row 6: max possible Rsquare 2 - File T with the following format - row 1: Likelihood ratio test statistic, degree of freedom, P-value - row 2: Wald test statistic, degree of freedom, P-value - row 3: Score (log-rank) test statistic, degree of freedom, P-value

Returns:

Additionally, the following matrices are stored (needed for prediction) 1- A column matrix RT that contains the order-preserving recoded timestamps from X 2- Matrix XO which is matrix X with sorted timestamps 3- Variance-covariance matrix of the betas COV 4- A column matrix MF that contains the column indices of X with the baseline factors removed (if available)

systemds.operator.algorithm.cspline(X: Matrix, Y: Matrix, inp_x: float, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Solves Cubic Spline Interpolation

Algorithms: implement https://en.wikipedia.org/wiki/Spline_interpolation#Algorithm_to_find_the_interpolating_cubic_spline It use natural spline with q1’’(x0) == qn’’(xn) == 0.0

Parameters:
  • X – 1-column matrix of x values knots. It is assumed that x values are monotonically increasing and there is no duplicates points in X

  • Y – 1-column matrix of corresponding y values knots

  • inp_x – the given input x, for which the cspline will find predicted y

  • mode – Specifies the method for cspline (DS - Direct Solve, CG - Conjugate Gradient)

  • tol – Tolerance (epsilon); conjugate graduent procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm

  • maxi – Maximum number of conjugate gradient iterations, 0 = no maximum

Returns:

Predicted value

Returns:

Matrix of k parameters

systemds.operator.algorithm.csplineCG(X: Matrix, Y: Matrix, inp_x: float, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Builtin that solves cubic spline interpolation using conjugate gradient algorithm

Parameters:
  • X – 1-column matrix of x values knots. It is assumed that x values are monotonically increasing and there is no duplicates points in X

  • Y – 1-column matrix of corresponding y values knots

  • inp_x – the given input x, for which the cspline will find predicted y.

  • tol – Tolerance (epsilon); conjugate graduent procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm

  • maxi – Maximum number of conjugate gradient iterations, 0 = no maximum

Returns:

Predicted value

Returns:

Matrix of k parameters

systemds.operator.algorithm.csplineDS(X: Matrix, Y: Matrix, inp_x: float)

Builtin that solves cubic spline interpolation using a direct solver.

Parameters:
  • X – 1-column matrix of x values knots. It is assumed that x values are monotonically increasing and there is no duplicates points in X

  • Y – 1-column matrix of corresponding y values knots

  • inp_x – the given input x, for which the cspline will find predicted y.

Returns:

Predicted value

Returns:

Matrix of k parameters

systemds.operator.algorithm.cvlm(X: Matrix, y: Matrix, k: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The cvlm-function is used for cross-validation of the provided data model. This function follows a non-exhaustive cross validation method. It uses lm and lmPredict functions to solve the linear regression and to predict the class of a feature vector with no intercept, shifting, and rescaling.

Parameters:
  • X – Recorded Data set into matrix

  • y – 1-column matrix of response values.

  • k – Number of subsets needed, It should always be more than 1 and less than nrow(X)

  • icpt – Intercept presence, shifting and rescaling the columns of X

  • reg – Regularization constant (lambda) for L2-regularization. set to nonzero for highly dependant/sparse/numerous features

Returns:

Response values

Returns:

Validated data set

systemds.operator.algorithm.dbscan(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Implements the DBSCAN clustering algorithm using Euclidian distance matrix

Parameters:
  • X – The input Matrix to do DBSCAN on.

  • eps – Maximum distance between two points for one to be considered reachable for the other.

  • minPts – Number of points in a neighborhood for a point to be considered as a core point (includes the point itself).

Returns:

clustering Matrix

systemds.operator.algorithm.dbscanApply(X: Matrix, clusterModel: Matrix, eps: float)

Implements the outlier detection/prediction algorithm using a DBScan model

Parameters:
  • X – The input Matrix to do outlier detection on.

  • clusterModel – Model of clusters to predict outliers against.

  • eps – Maximum distance between two points for one to be considered reachable for the other.

Returns:

Predicted outliers

systemds.operator.algorithm.decisionTree(X: Matrix, Y: Matrix, R: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Builtin script implementing classification trees with scale and categorical features

Parameters:
  • X – Feature matrix X; note that X needs to be both recoded and dummy coded

  • Y – Label matrix Y; note that Y needs to be both recoded and dummy coded

  • R – Matrix R which for each feature in X contains the following information - R[1,]: Row Vector which indicates if feature vector is scalar or categorical. 1 indicates a scalar feature vector, other positive Integers indicate the number of categories If R is not provided by default all variables are assumed to be scale

  • bins – Number of equiheight bins per scale feature to choose thresholds

  • depth – Maximum depth of the learned tree

  • verbose – boolean specifying if the algorithm should print information while executing

Returns:

Matrix M where each column corresponds to a node in the learned tree and each row contains the following information: M[1,j]: id of node j (in a complete binary tree) M[2,j]: Offset (no. of columns) to left child of j if j is an internal node, otherwise 0 M[3,j]: Feature index of the feature (scale feature id if the feature is scale or categorical feature id if the feature is categorical) that node j looks at if j is an internal node, otherwise 0 M[4,j]: Type of the feature that node j looks at if j is an internal node: holds the same information as R input vector M[5,j]: If j is an internal node: 1 if the feature chosen for j is scale, otherwise the size of the subset of values stored in rows 6,7,… if j is categorical If j is a leaf node: number of misclassified samples reaching at node j M[6:,j]: If j is an internal node: Threshold the example’s feature value is compared to is stored at M[6,j] if the feature chosen for j is scale, otherwise if the feature chosen for j is categorical rows 6,7,… depict the value subset chosen for j If j is a leaf node 1 if j is impure and the number of samples at j > threshold, otherwise 0

systemds.operator.algorithm.decisionTreePredict(M: Matrix, X: Matrix, strategy: str)

Builtin script implementing prediction based on classification trees with scale features using prediction methods of the Hummingbird paper (https://www.usenix.org/system/files/osdi20-nakandala.pdf).

Parameters:
  • M – Decision tree matrix M, as generated by scripts/builtin/decisionTree.dml, where each column corresponds to a node in the learned tree and each row contains the following information: M[1,j]: id of node j (in a complete binary tree) M[2,j]: Offset (no. of columns) to left child of j if j is an internal node, otherwise 0 M[3,j]: Feature index of the feature (scale feature id if the feature is scale or categorical feature id if the feature is categorical) that node j looks at if j is an internal node, otherwise 0 M[4,j]: Type of the feature that node j looks at if j is an internal node: holds the same information as R input vector M[5,j]: If j is an internal node: 1 if the feature chosen for j is scale, otherwise the size of the subset of values stored in rows 6,7,… if j is categorical If j is a leaf node: number of misclassified samples reaching at node j M[6:,j]: If j is an internal node: Threshold the example’s feature value is compared to is stored at M[6,j] if the feature chosen for j is scale, otherwise if the feature chosen for j is categorical rows 6,7,… depict the value subset chosen for j If j is a leaf node 1 if j is impure and the number of samples at j > threshold, otherwise 0

  • X – Feature matrix X

  • strategy – Prediction strategy, can be one of [“GEMM”, “TT”, “PTT”], referring to “Generic matrix multiplication”, “Tree traversal”, and “Perfect tree traversal”, respectively

Returns:

Matrix containing the predicted labels for X

systemds.operator.algorithm.deepWalk(Graph: Matrix, w: int, d: int, gamma: int, t: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This script performs DeepWalk on a given graph (https://arxiv.org/pdf/1403.6652.pdf)

Parameters:
  • Graph – adjacency matrix of a graph (n x n)

  • w – window size

  • d – embedding size

  • gamma – walks per vertex

  • t – walk length

  • alpha – learning rate

  • beta – factor for decreasing learning rate

Returns:

matrix of vertex/word representation (n x d)

systemds.operator.algorithm.denialConstraints(dataFrame: Frame, constraintsFrame: Frame)

This function considers some constraints indicating statements that can NOT happen in the data (denial constraints).

EXAMPLE:
dataFrame:

     rank       discipline   yrs.since.phd   yrs.service   sex      salary
1    Prof       B            19              18            Male     139750
2    Prof       B            20              16            Male     173200
3    AsstProf   B            3               3             Male     79750.56
4    Prof       B            45              39            Male     115000
5    Prof       B            40              40            Male     141500
6    AssocProf  B            6               6             Male     97000
7    Prof       B            30              23            Male     175000
8    Prof       B            45              45            Male     147765
9    Prof       B            21              20            Male     119250
10   Prof       B            18              18            Female   129000
11   AssocProf  B            12              8             Male     119800
12   AsstProf   B            7               2             Male     79800
13   AsstProf   B            1               1             Male     77700

constraintsFrame:

idx   constraint.type   group.by   group.variable      group.option   variable1      relation   variable2
1     variableCompare   FALSE                                         yrs.since.phd  <          yrs.service
2     instanceCompare   TRUE       rank                Prof           yrs.service    ><         salary
3     valueCompare      FALSE                                         salary         =          78182
4     variableCompare   TRUE       discipline          B              yrs.service    >          yrs.since.phd

Example: explanation of constraint 2 –> it can’t happen that one professor of rank Prof has more years of service than other, but lower salary.

Parameters:
  • dataFrame – frame which columns represent the variables of the data and the rows correspond to different tuples or instances. Recommended to have a column indexing the instances from 1 to N (N=number of instances).

  • constraintsFrame – frame with fixed columns and each row representing one constraint. 1. idx: (double) index of the constraint, from 1 to M (number of constraints) 2. constraint.type: (string) The constraints can be of 3 different kinds: - variableCompare: for each instance, it will compare the values of two variables (with a relation <, > or =). - valueCompare: for each instance, it will compare a fixed value and a variable value (with a relation <, > or =). - instanceCompare: for every couple of instances, it will compare the relation between two variables, ie if the value of the variable 1 in instance 1 is lower/higher than the value of variable 1 in instance 2, then the value of of variable 2 in instance 2 can’t be lower/higher than the value of variable 2 in instance 2. 3. group.by: (boolean) if TRUE only one group of data (defined by a variable option) will be considered for the constraint. 4. group.variable: (string, only if group.by TRUE) name of the variable (column in dataFrame) that will divide our data in groups. 5. group.option: (only if group.by TRUE) option of the group.variable that defines the group to consider. 6. variable1: (string) first variable to compare (name of column in dataFrame). 7. relation: (string) can be < , > or = in the case of variableCompare and valueCompare, and < >, < < , > < or > > in the case of instanceCompare 8. variable2: (string) second variable to compare (name of column in dataFrame) or fixed value for the case of valueCompare.

Returns:

Matrix of 2 columns. - First column shows the indexes of dataFrame that are wrong. - Second column shows the index of the denial constraint that is fulfilled If there are no wrong instances to show (0 constrains fulfilled) –> WrongInstances=matrix(0,1,2)

systemds.operator.algorithm.discoverFD(X: Matrix, Mask: Matrix, threshold: float)

Implements builtin for finding functional dependencies

Parameters:
  • X – Input Matrix X, encoded Matrix if data is categorical

  • Mask – A row vector for interested features i.e. Mask =[1, 0, 1] will exclude the second column from processing

  • threshold – threshold value in interval [0, 1] for robust FDs

Returns:

matrix of functional dependencies

systemds.operator.algorithm.dist(X: Matrix)

Returns Euclidean distance matrix (distances between N n-dimensional points)

Parameters:

X – Matrix to calculate the distance inside

Returns:

Euclidean distance matrix

systemds.operator.algorithm.dmv(X: Frame, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The dmv-function is used to find disguised missing values utilising syntactical pattern recognition.

Parameters:
  • X – Input Frame

  • threshold – Threshold value in interval [0, 1] for dominant pattern per column (e.g., 0.8 means that 80% of the entries per column must adhere this pattern to be dominant)

  • replace – The string disguised missing values are replaced with

Returns:

Frame X including detected disguised missing values

systemds.operator.algorithm.ema(X: Frame, search_iterations: int, mode: str, freq: int, alpha: float, beta: float, gamma: float)

This function imputes values with exponential moving average (single, double or triple).

Parameters:
  • X – Frame that contains time series data that needs to be imputed search_iterations Integer – Budget iterations for parameter optimization, used if parameters weren’t set

  • mode – Type of EMA method. Either “single”, “double” or “triple”

  • freq – Seasonality when using triple EMA.

  • alpha – alpha- value for EMA

  • beta – beta- value for EMA

  • gamma – gamma- value for EMA

Returns:

Frame with EMA results

systemds.operator.algorithm.executePipeline(pipeline: Frame, Xtrain: Matrix, Ytrain: Matrix, Xtest: Matrix, Ytest: Matrix, metaList: List, hyperParameters: Matrix, flagsCount: int, verbose: bool, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This function execute pipeline.

Parameters:
  • logical

  • pipeline

  • X

  • Y

  • Xtest

  • Ytest

  • metaList

  • hyperParameters

  • hpForPruning

  • changesByOp

  • flagsCount

  • test

  • verbose

Returns:

Returns:

Returns:

Returns:

Returns:

Returns:

Returns:

systemds.operator.algorithm.ffPredict(model: List, X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This builtin function makes prediction given data and trained feedforward neural network model

Parameters:
  • Model – Trained ff neural network model

  • X – Data used for making predictions

  • batch_size – Batch size

Returns:

Predicted value

systemds.operator.algorithm.ffTrain(X: Matrix, Y: Matrix, out_activation: str, loss_fcn: str, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This builtin function trains simple feed-forward neural network. The architecture of the networks is: affine1 -> relu -> dropout -> affine2 -> configurable output activation function. Hidden layer has 128 neurons. Dropout rate is 0.35. Input and output sizes are inferred from X and Y.

Parameters:
  • X – Training data

  • Y – Labels/Target values

  • batch_size – Batch size

  • epochs – Number of epochs

  • learning_rate – Learning rate

  • out_activation – User specified output activation function. Possible values: “sigmoid”, “relu”, “lrelu”, “tanh”, “softmax”, “logits” (no activation).

  • loss_fcn – User specified loss function. Possible values: “l1”, “l2”, “log_loss”, “logcosh_loss”, “cel” (cross-entropy loss).

  • shuffle – Flag which indicates if dataset should be shuffled or not

  • validation_split – Fraction of training set used as validation set

  • seed – Seed for model initialization

  • verbose – Flag which indicates if function should print to stdout

Returns:

Trained model which can be used in ffPredict

systemds.operator.algorithm.fit_pipeline(trainData: Frame, testData: Frame, pip: Frame, applyFunc: Frame, hp: Matrix, evaluationFunc: str, evalFunHp: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This script will read the dirty and clean data, then it will apply the best pipeline on dirty data and then will classify both cleaned dataset and check if the cleaned dataset is performing same as original dataset in terms of classification accuracy

Parameters:
  • trainData

  • testData

  • metaData

  • lp

  • pip

  • hp

  • evaluationFunc

  • evalFunHp

  • isLastLabel

  • correctTypos

Returns:

systemds.operator.algorithm.fixInvalidLengths(F1: Frame, mask: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Fix invalid lengths

Parameters:
  • F1

  • mask

  • ql

  • qu

Returns:

Returns:

systemds.operator.algorithm.fixInvalidLengthsApply(X: Frame, mask: Matrix, qLow: Matrix, qUp: Matrix)

Fix invalid lengths

Parameters:
  • X

  • mask

  • ql

  • qu

Returns:

Returns:

systemds.operator.algorithm.frameSort(F: Frame, mask: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Related to [SYSTEMDS-2662] dependency function for cleaning pipelines Built-in for sorting frames

Parameters:
  • F – Data frame of string values

  • mask – matrix for identifying string columns

Returns:

sorted dataset by column 1 in decreasing order

systemds.operator.algorithm.frequencyEncode(X: Matrix, mask: Matrix)

function frequency conversion

Parameters:
  • X – dataset x

  • mask – mask of the columns for frequency conversion

Returns:

categorical columns are replaced with their frequencies

Returns:

the frequency counts for the different categoricals

systemds.operator.algorithm.frequencyEncodeApply(X: Matrix, freqCount: Matrix)

frequency code apply

Parameters:
  • X – dataset x

  • freqCount – the frequency counts for the different categoricals

Returns:

categorical columns are replaced with their frequencies given

systemds.operator.algorithm.garch(X: Matrix, kmax: int, momentum: float, start_stepsize: float, end_stepsize: float, start_vicinity: float, end_vicinity: float, sim_seed: int, verbose: bool)

This is a builtin function that implements GARCH(1,1), a statistical model used in analyzing time-series data where the variance error is believed to be serially autocorrelated

COMMENTS This has some drawbacks: slow convergence of optimization (sort of simulated annealing/gradient descent) TODO: use BFGS or BHHH if it is available (this are go to methods) TODO: (only then) extend to garch(p,q); otherwise the search space is way too big for the current method

Parameters:
  • X – The input Matrix to apply Arima on.

  • kmax – Number of iterations

  • momentum – Momentum for momentum-gradient descent (set to 0 to deactivate)

  • start_stepsize – Initial gradient-descent stepsize

  • end_stepsize – gradient-descent stepsize at end (linear descent)

  • start_vicinity – proportion of randomness of restart-location for gradient descent at beginning

  • end_vicinity – same at end (linear decay)

  • sim_seed – seed for simulation of process on fitted coefficients

  • verbose – verbosity, comments during fitting

Returns:

simulated garch(1,1) process on fitted coefficients

Returns:

variances of simulated fitted process

Returns:

onstant term of fitted process

Returns:

1-st arch-coefficient of fitted process

Returns:

1-st garch-coefficient of fitted process

systemds.operator.algorithm.gaussianClassifier(D: Matrix, C: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Computes the parameters needed for Gaussian Classification. Thus it computes the following per class: the prior probability, the inverse covariance matrix, the mean per feature and the determinant of the covariance matrix. Furthermore (if not explicitly defined), it adds some small smoothing value along the variances, to prevent numerical errors / instabilities.

Parameters:
  • D – Input matrix (training set)

  • C – Target vector

  • varSmoothing – Smoothing factor for variances

  • verbose – Print accuracy of the training set

Returns:

Vector storing the class prior probabilities

Returns:

Matrix storing the means of the classes

Returns:

List of inverse covariance matrices

Returns:

Vector storing the determinants of the classes

systemds.operator.algorithm.getAccuracy(y: Matrix, yhat: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This builtin function compute the weighted and simple accuracy for given predictions

Parameters:
  • y – Ground truth (Actual Labels)

  • yhat – Predictions (Predicted labels)

  • isWeighted – Flag for weighted or non-weighted accuracy calculation

Returns:

accuracy of the predicted labels

systemds.operator.algorithm.glm(X: Matrix, Y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This script solves GLM regression using NEWTON/FISHER scoring with trust regions. The glm-function is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models.

In addition, some GLM statistics are provided as console output by setting verbose=TRUE, one comma-separated name-value pair per each line, as follows:

--------------------------------------------------------------------------------------------
TERMINATION_CODE      A positive integer indicating success/failure as follows:
                      1 = Converged successfully; 2 = Maximum number of iterations reached; 
                      3 = Input (X, Y) out of range; 4 = Distribution/link is not supported
BETA_MIN              Smallest beta value (regression coefficient), excluding the intercept
BETA_MIN_INDEX        Column index for the smallest beta value
BETA_MAX              Largest beta value (regression coefficient), excluding the intercept
BETA_MAX_INDEX        Column index for the largest beta value
INTERCEPT             Intercept value, or NaN if there is no intercept (if icpt=0)
DISPERSION            Dispersion used to scale deviance, provided as "disp" input parameter
                      or estimated (same as DISPERSION_EST) if the "disp" parameter is <= 0
DISPERSION_EST        Dispersion estimated from the dataset
DEVIANCE_UNSCALED     Deviance from the saturated model, assuming dispersion == 1.0
DEVIANCE_SCALED       Deviance from the saturated model, scaled by the DISPERSION value
--------------------------------------------------------------------------------------------

The Log file, when requested, contains the following per-iteration variables in CSV format,
each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for initial values:

--------------------------------------------------------------------------------------------
NUM_CG_ITERS          Number of inner (Conj.Gradient) iterations in this outer iteration
IS_TRUST_REACHED      1 = trust region boundary was reached, 0 = otherwise
POINT_STEP_NORM       L2-norm of iteration step from old point (i.e. "beta") to new point
OBJECTIVE             The loss function we minimize (i.e. negative partial log-likelihood)
OBJ_DROP_REAL         Reduction in the objective during this iteration, actual value
OBJ_DROP_PRED         Reduction in the objective predicted by a quadratic approximation
OBJ_DROP_RATIO        Actual-to-predicted reduction ratio, used to update the trust region
GRADIENT_NORM         L2-norm of the loss function gradient (NOTE: sometimes omitted)
LINEAR_TERM_MIN       The minimum value of X %*% beta, used to check for overflows
LINEAR_TERM_MAX       The maximum value of X %*% beta, used to check for overflows
IS_POINT_UPDATED      1 = new point accepted; 0 = new point rejected, old point restored
TRUST_DELTA           Updated trust region size, the "delta"
--------------------------------------------------------------------------------------------

SOME OF THE SUPPORTED GLM DISTRIBUTION FAMILIES AND LINK FUNCTIONS:

dfam vpow link lpow  Distribution.link   nical?
---------------------------------------------------
 1   0.0   1  -1.0   Gaussian.inverse
 1   0.0   1   0.0   Gaussian.log
 1   0.0   1   1.0   Gaussian.id          Yes
 1   1.0   1   0.0   Poisson.log          Yes
 1   1.0   1   0.5   Poisson.sqrt
 1   1.0   1   1.0   Poisson.id
 1   2.0   1  -1.0   Gamma.inverse        Yes
 1   2.0   1   0.0   Gamma.log
 1   2.0   1   1.0   Gamma.id
 1   3.0   1  -2.0   InvGaussian.1/mu^2   Yes
 1   3.0   1  -1.0   InvGaussian.inverse
 1   3.0   1   0.0   InvGaussian.log
 1   3.0   1   1.0   InvGaussian.id
 1    *    1    *    AnyVariance.AnyLink
---------------------------------------------------
 2    *    1   0.0   Binomial.log
 2    *    1   0.5   Binomial.sqrt
 2    *    2    *    Binomial.logit       Yes
 2    *    3    *    Binomial.probit
 2    *    4    *    Binomial.cloglog
 2    *    5    *    Binomial.cauchit
---------------------------------------------------
Parameters:
  • X – matrix X of feature vectors

  • Y – matrix Y with either 1 or 2 columns: if dfam = 2, Y is 1-column Bernoulli or 2-column Binomial (#pos, #neg)

  • dfam – Distribution family code: 1 = Power, 2 = Binomial

  • vpow – Power for Variance defined as (mean)^power (ignored if dfam != 1): 0.0 = Gaussian, 1.0 = Poisson, 2.0 = Gamma, 3.0 = Inverse Gaussian

  • link – Link function code: 0 = canonical (depends on distribution), 1 = Power, 2 = Logit, 3 = Probit, 4 = Cloglog, 5 = Cauchit

  • lpow – Power for Link function defined as (mean)^power (ignored if link != 1): -2.0 = 1/mu^2, -1.0 = reciprocal, 0.0 = log, 0.5 = sqrt, 1.0 = identity

  • yneg – Response value for Bernoulli “No” label, usually 0.0 or -1.0

  • icpt – Intercept presence, X columns shifting and rescaling: 0 = no intercept, no shifting, no rescaling; 1 = add intercept, but neither shift nor rescale X; 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1

  • reg – Regularization parameter (lambda) for L2 regularization

  • tol – Tolerance (epsilon)

  • disp – (Over-)dispersion value, or 0.0 to estimate it from data

  • moi – Maximum number of outer (Newton / Fisher Scoring) iterations

  • mii – Maximum number of inner (Conjugate Gradient) iterations, 0 = no maximum

  • verbose – if the Algorithm should be verbose

Returns:

Matrix beta, whose size depends on icpt: icpt=0: ncol(X) x 1; icpt=1: (ncol(X) + 1) x 1; icpt=2: (ncol(X) + 1) x 2

systemds.operator.algorithm.glmPredict(X: Matrix, B: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Applies the estimated parameters of a GLM type regression to a new dataset

Additional statistics are printed one per each line, in the following

CSV format: NAME,[COLUMN],[SCALED],VALUE
---
NAME   is the string identifier for the statistic, see the table below.
COLUMN is an optional integer value that specifies the Y-column for per-column statistics;
       note that a Binomial/Multinomial one-column Y input is converted into multi-column.
SCALED is an optional Boolean value (TRUE or FALSE) that tells us whether or not the input
         dispersion parameter (disp) scaling has been applied to this statistic.
VALUE  is the value of the statistic.
---
NAME                  COLUMN  SCALED  MEANING
---------------------------------------------------------------------------------------------
LOGLHOOD_Z                      +     Log-Likelihood Z-score (in st.dev's from mean)
LOGLHOOD_Z_PVAL                 +     Log-Likelihood Z-score p-value
PEARSON_X2                      +     Pearson residual X^2 statistic
PEARSON_X2_BY_DF                +     Pearson X^2 divided by degrees of freedom
PEARSON_X2_PVAL                 +     Pearson X^2 p-value
DEVIANCE_G2                     +     Deviance from saturated model G^2 statistic
DEVIANCE_G2_BY_DF               +     Deviance G^2 divided by degrees of freedom
DEVIANCE_G2_PVAL                +     Deviance G^2 p-value
AVG_TOT_Y               +             Average of Y column for a single response value
STDEV_TOT_Y             +             St.Dev. of Y column for a single response value
AVG_RES_Y               +             Average of column residual, i.e. of Y - mean(Y|X)
STDEV_RES_Y             +             St.Dev. of column residual, i.e. of Y - mean(Y|X)
PRED_STDEV_RES          +       +     Model-predicted St.Dev. of column residual
R2                      +             R^2 of Y column residual with bias included
ADJUSTED_R2             +             Adjusted R^2 of Y column residual with bias included
R2_NOBIAS               +             R^2 of Y column residual with bias subtracted
ADJUSTED_R2_NOBIAS      +             Adjusted R^2 of Y column residual with bias subtracted
---------------------------------------------------------------------------------------------
Parameters:
  • X – Matrix X of records (feature vectors)

  • B – GLM regression parameters (the betas), with dimensions ncol(X) x k: do not add intercept ncol(X)+1 x k: add intercept as given by the last B-row if k > 1, use only B[, 1] unless it is Multinomial Logit (dfam=3)

  • ytest – Response matrix Y, with the following dimensions: nrow(X) x 1 : for all distributions (dfam=1 or 2 or 3) nrow(X) x 2 : for Binomial (dfam=2) given by (#pos, #neg) counts nrow(X) x k+1: for Multinomial (dfam=3) given by category counts

  • dfam – GLM distribution family: 1 = Power, 2 = Binomial, 3 = Multinomial Logit

  • vpow – Power for Variance defined as (mean)^power (ignored if dfam != 1): 0.0 = Gaussian, 1.0 = Poisson, 2.0 = Gamma, 3.0 = Inverse Gaussian

  • link – Link function code: 0 = canonical (depends on distribution), 1 = Power, 2 = Logit, 3 = Probit, 4 = Cloglog, 5 = Cauchit; ignored if Multinomial

  • lpow – Power for Link function defined as (mean)^power (ignored if link != 1): -2.0 = 1/mu^2, -1.0 = reciprocal, 0.0 = log, 0.5 = sqrt, 1.0 = identity

  • disp – Dispersion value, when available

  • verbose – Print statistics to stdout

Returns:

Matrix M of predicted means/probabilities: nrow(X) x 1 : for Power-type distributions (dfam=1) nrow(X) x 2 : for Binomial distribution (dfam=2), column 2 is “No” nrow(X) x k+1: for Multinomial Logit (dfam=3), col# k+1 is baseline

systemds.operator.algorithm.gmm(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Gaussian Mixture Model training algorithm. There are four different types of covariance matrices i.e., VVV, EEE, VVI, VII and two initialization methods namely “kmeans” and “random”.

Parameters:
  • X – Dataset input to fit the GMM model

  • n_components – Number of components to use in the Gaussian mixture model

  • model – “VVV”: unequal variance (full),each component has its own general covariance matrix “EEE”: equal variance (tied), all components share the same general covariance matrix “VVI”: spherical, unequal volume (diag), each component has its own diagonal covariance matrix “VII”: spherical, equal volume (spherical), each component has its own single variance

  • init_param – Initialization algorithm to use to initialize the gaussian weights, valid inputs are: “kmeans” or “random”

  • iterations – Number of iterations

  • reg_covar – Regularization parameter for covariance matrix

  • tol – Tolerance value for convergence

  • seed – The seed value to initialize the values for fitting the GMM.

Returns:

The predictions made by the gaussian model on the X input dataset

Returns:

Probability of the predictions given the X input dataset

Returns:

Number of estimated parameters

Returns:

Bayesian information criterion for best iteration

Returns:

Fitted clusters mean

Returns:

Fitted precision matrix for each mixture

Returns:

The weight matrix: A matrix whose [i,k]th entry is the probability that observation i in the test data belongs to the kth class

systemds.operator.algorithm.gmmPredict(X: Matrix, weight: Matrix, mu: Matrix, precisions_cholesky: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Prediction function for a Gaussian Mixture Model (gmm). Compute posterior probabilities for new instances given the variance and mean of fitted dat.

Parameters:
  • X – Dataset input to predict the labels from

  • weight – Weight of learned model: A matrix whose [i,k]th entry is the probability that observation i in the test data belongs to the kth class

  • mu – Fitted clusters mean

  • precisions_cholesky – Fitted precision matrix for each mixture

  • model – “VVV”: unequal variance (full),each component has its own general covariance matrix “EEE”: equal variance (tied), all components share the same general covariance matrix “VVI”: spherical, unequal volume (diag), each component has its own diagonal covariance matrix “VII”: spherical, equal volume (spherical), each component has its own single variance

Returns:

The predictions made by the gaussian model on the X input dataset

Returns:

Probability of the predictions given the X input dataset

systemds.operator.algorithm.gnmf(X: Matrix, rnk: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The gnmf-function does Gaussian Non-Negative Matrix Factorization. In this, a matrix X is factorized into two matrices W and H, such that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect.

References: [Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang: Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. WWW 2010: 681-690]

Parameters:
  • X – Matrix of feature vectors.

  • rnk – Number of components into which matrix X is to be factored

  • eps – Tolerance

  • maxi – Maximum number of conjugate gradient iterations

Returns:

List of pattern matrices, one for each repetition

Returns:

List of amplitude matrices, one for each repetition

systemds.operator.algorithm.gridSearch(X: Matrix, y: Matrix, train: str, predict: str, params: List, paramValues: List, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The gridSearch-function is used to find the optimal hyper-parameters of a model which results in the most accurate predictions. This function takes train and eval functions by name.

Parameters:
  • X – Input feature matrix

  • y – Input Matrix of vectors.

  • train – Name ft of the train function to call via ft(trainArgs)

  • predict – Name fp of the loss function to call via fp((predictArgs,B))

  • numB – Maximum number of parameters in model B (pass the max because the size may vary with parameters like icpt or multi-class classification)

  • params – List of varied hyper-parameter names

  • dataArgs – List of data parameters (to identify data parameters by name i.e. list(“X”, “Y”))

  • paramValues – List of matrices providing the parameter values as columnvectors for position-aligned hyper-parameters in ‘params’

  • trainArgs – named List of arguments to pass to the ‘train’ function, where gridSearch replaces enumerated hyper-parameter by name, if not provided or an empty list, the lm parameters are used

  • predictArgs – List of arguments to pass to the ‘predict’ function, where gridSearch appends the trained models at the end, if not provided or an empty list, list(X, y) is used instead

  • cv – flag enabling k-fold cross validation, otherwise training loss

  • cvk – if cv=TRUE, specifies the the number of folds, otherwise ignored

  • verbose – flag for verbose debug output

Returns:

Matrix[Double]the trained model with minimal loss (by the ‘predict’ function) Multi-column models are returned as a column-major linearized column vector

Returns:

one-row frame w/ optimal hyper-parameters (by ‘params’ position)

systemds.operator.algorithm.hospitalResidencyMatch(R: Matrix, H: Matrix, capacity: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This script computes a solution for the hospital residency match problem.

Residents.mtx: 2.0,1.0,3.0 1.0,2.0,3.0 1.0,2.0,0.0

Since it is an ORDERED matrix, this means that Resident 1 (row 1) likes hospital 2 the most, followed by hospital 1 and hospital 3. If it was UNORDERED, this would mean that resident 1 (row 1) likes hospital 3 the most (since the value at [1,3] is the row max), followed by hospital 1 (2.0 preference value) and hospital 2 (1.0 preference value).

Hospitals.mtx: 2.0,1.0,0.0 0.0,1.0,2.0 1.0,2.0,0.0

Since it is an UNORDERED matrix this means that Hospital 1 (row 1) likes Resident 1 the most (since the value at [1,1] is the row max).

capacity.mtx 1.0 1.0 1.0

residencyMatch.mtx 2.0,0.0,0.0 1.0,0.0,0.0 0.0,2.0,0.0

hospitalMatch.mtx 0.0,1.0,0.0 0.0,0.0,2.0 1.0,0.0,0.0

Resident 1 has matched with Hospital 3 (since [1,3] is non-zero) at a preference level of 2.0. Resident 2 has matched with Hospital 1 (since [2,1] is non-zero) at a preference level of 1.0. Resident 3 has matched with Hospital 2 (since [3,2] is non-zero) at a preference level of 2.0.

Parameters:
  • R – Residents matrix R. It must be an ORDERED matrix.

  • H – Hospitals matrix H. It must be an UNORDRED matrix.

  • capacity – capacity of Hospitals matrix C. It must be a [n*1] matrix with non zero values. i.e. the leftmost value in a row is the most preferred partner’s index. i.e. the leftmost value in a row in P is the preference value for the acceptor with index 1 and vice-versa (higher is better).

  • verbose – If the operation is verbose

Returns:

Result Matrix If cell [i,j] is non-zero, it means that Resident i has matched with Hospital j. Further, if cell [i,j] is non-zero, it holds the preference value that led to the match.

Returns:

Result Matrix If cell [i,j] is non-zero, it means that Resident i has matched with Hospital j. Further, if cell [i,j] is non-zero, it holds the preference value that led to the match.

systemds.operator.algorithm.hyperband(X_train: Matrix, y_train: Matrix, X_val: Matrix, y_val: Matrix, params: List, paramRanges: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The hyperband-function is used for hyper parameter optimization and is based on multi-armed bandits and early elimination. Through multiple parallel brackets and consecutive trials it will return the hyper parameter combination which performed best on a validation dataset. A set of hyper parameter combinations is drawn from uniform distributions with given ranges; Those make up the candidates for hyperband. Notes: hyperband is hard-coded for lmCG, and uses lmPredict for validation hyperband is hard-coded to use the number of iterations as a resource hyperband can only optimize continuous hyperparameters

Parameters:
  • X_train – Input Matrix of training vectors

  • y_train – Labels for training vectors

  • X_val – Input Matrix of validation vectors

  • y_val – Labels for validation vectors

  • params – List of parameters to optimize

  • paramRanges – The min and max values for the uniform distributions to draw from. One row per hyper parameter, first column specifies min, second column max value.

  • R – Controls number of candidates evaluated

  • eta – Determines fraction of candidates to keep after each trial

  • verbose – If TRUE print messages are activated

Returns:

1-column matrix of weights of best performing candidate

Returns:

hyper parameters of best performing candidate

systemds.operator.algorithm.img_brightness(img_in: Matrix, value: float, channel_max: int)

The img_brightness-function is an image data augmentation function. It changes the brightness of the image.

Parameters:
  • img_in – Input matrix/image

  • value – The amount of brightness to be changed for the image

  • channel_max – Maximum value of the brightness of the image

Returns:

Output matrix/image

systemds.operator.algorithm.img_crop(img_in: Matrix, w: int, h: int, x_offset: int, y_offset: int)

The img_crop-function is an image data augmentation function. It cuts out a subregion of an image.

Parameters:
  • img_in – Input matrix/image

  • w – The width of the subregion required

  • h – The height of the subregion required

  • x_offset – The horizontal coordinate in the image to begin the crop operation

  • y_offset – The vertical coordinate in the image to begin the crop operation

Returns:

Cropped matrix/image

systemds.operator.algorithm.img_cutout(img_in: Matrix, x: int, y: int, width: int, height: int, fill_value: float)

Image Cutout function replaces a rectangular section of an image with a constant value.

Parameters:
  • img_in – Input image as 2D matrix with top left corner at [1, 1]

  • x – Column index of the top left corner of the rectangle (starting at 1)

  • y – Row index of the top left corner of the rectangle (starting at 1)

  • width – Width of the rectangle (must be positive)

  • height – Height of the rectangle (must be positive)

  • fill_value – The value to set for the rectangle

Returns:

Output image as 2D matrix with top left corner at [1, 1]

systemds.operator.algorithm.img_invert(img_in: Matrix, max_value: float)

This is an image data augmentation function. It inverts an image.

Parameters:
  • img_in – Input image

  • max_value – The maximum value pixels can have

Returns:

Output image

systemds.operator.algorithm.img_mirror(img_in: Matrix, horizontal_axis: bool)

This function is an image data augmentation function. It flips an image on the X (horizontal) or Y (vertical) axis.

Parameters:
  • img_in – Input matrix/image

  • max_value – The maximum value pixels can have

Returns:

Flipped matrix/image

systemds.operator.algorithm.img_posterize(img_in: Matrix, bits: int)

The Image Posterize function limits pixel values to 2^bits different values in the range [0, 255]. Assumes the input image can attain values in the range [0, 255].

Parameters:
  • img_in – Input image

  • bits – The number of bits keep for the values. 1 means black and white, 8 means every integer between 0 and 255.

Returns:

Output image

systemds.operator.algorithm.img_rotate(img_in: Matrix, radians: float, fill_value: float)

The Image Rotate function rotates the input image counter-clockwise around the center. Uses nearest neighbor sampling.

Parameters:
  • img_in – Input image as 2D matrix with top left corner at [1, 1]

  • radians – The value by which to rotate in radian.

  • fill_value – The background color revealed by the rotation

Returns:

Output image as 2D matrix with top left corner at [1, 1]

systemds.operator.algorithm.img_sample_pairing(img_in1: Matrix, img_in2: Matrix, weight: float)

The image sample pairing function blends two images together.

Parameters:
  • img_in1 – First input image

  • img_in2 – Second input image

  • weight – The weight given to the second image. 0 means only img_in1, 1 means only img_in2 will be visible

Returns:

Output image

systemds.operator.algorithm.img_shear(img_in: Matrix, shear_x: float, shear_y: float, fill_value: float)

This function applies a shearing transformation to an image. Uses nearest neighbor sampling.

Parameters:
  • img_in – Input image as 2D matrix with top left corner at [1, 1]

  • shear_x – Shearing factor for horizontal shearing

  • shear_y – Shearing factor for vertical shearing

  • fill_value – The background color revealed by the shearing

Returns:

Output image as 2D matrix with top left corner at [1, 1]

systemds.operator.algorithm.img_transform(img_in: Matrix, out_w: int, out_h: int, a: float, b: float, c: float, d: float, e: float, f: float, fill_value: float)

The Image Transform function applies an affine transformation to an image. Optionally resizes the image (without scaling). Uses nearest neighbor sampling.

Parameters:
  • img_in – Input image as 2D matrix with top left corner at [1, 1]

  • out_w – Width of the output image

  • out_h – Height of the output image

  • a,b,c,d,e,f – The first two rows of the affine matrix in row-major order

  • fill_value – The background of the image

Returns:

Output image as 2D matrix with top left corner at [1, 1]

systemds.operator.algorithm.img_translate(img_in: Matrix, offset_x: float, offset_y: float, out_w: int, out_h: int, fill_value: float)

The Image Translate function translates the image. Optionally resizes the image (without scaling). Uses nearest neighbor sampling.

Parameters:
  • img_in – Input image as 2D matrix with top left corner at [1, 1]

  • offset_x – The distance to move the image in x direction

  • offset_y – The distance to move the image in y direction

  • out_w – Width of the output image

  • out_h – Height of the output image

  • fill_value – The background of the image

Returns:

Output image as 2D matrix with top left corner at [1, 1]

systemds.operator.algorithm.impurityMeasures(X: Matrix, Y: Matrix, R: Matrix, method: str, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This function computes the measure of impurity for the given dataset based on the passed method (gini or entropy). The current version expects the target vector to contain only 0 or 1 values.

Parameters:
  • X – Feature matrix.

  • Y – Target vector containing 0 and 1 values.

  • R – Vector indicating whether a feature is categorical or continuous. 1 denotes a continuous feature, 2 denotes a categorical feature.

  • n_bins – Number of bins for binning in case of scale features.

  • method – String indicating the method to use; either “entropy” or “gini”.

Returns:

(1 x ncol(X)) row vector containing information/gini gain for each feature of the dataset. In case of gini, the values denote the gini gains, i.e. how much impurity was removed with the respective split. The higher the value, the better the split. In case of entropy, the values denote the information gain, i.e. how much entropy was removed. The higher the information gain, the better the split.

systemds.operator.algorithm.imputeByFD(X: Matrix, Y: Matrix, threshold: float, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Implements builtin for imputing missing values from observed values (if exist) using robust functional dependencies

Parameters:
  • X – Vector X, source attribute of functional dependency

  • Y – Vector Y, target attribute of functional dependency and imputation

  • threshold – threshold value in interval [0, 1] for robust FDs

  • verbose – flag for printing verbose debug output

Returns:

Vector Y, with missing values mapped to a new max value

Returns:

Vector Y, with imputed missing values

systemds.operator.algorithm.imputeByFDApply(X: Matrix, Y_imp: Matrix)

Implements builtin for imputing missing values from observed values (if exist) using robust functional dependencies

Parameters:
  • X – Matrix X

  • source – source attribute to use for imputation and error correction

  • target – attribute to be fixed

  • threshold – threshold value in interval [0, 1] for robust FDs

Returns:

Matrix with possible imputations

systemds.operator.algorithm.imputeByMean(X: Matrix, mask: Matrix)

impute the data by mean value and if the feature is categorical then by mode value Related to [SYSTEMDS-2662] dependency function for cleaning pipelines

Parameters:
  • X – Data Matrix (Recoded Matrix for categorical features)

  • mask – A 0/1 row vector for identifying numeric (0) and categorical features (1)

Returns:

imputed dataset

systemds.operator.algorithm.imputeByMeanApply(X: Matrix, imputedVec: Matrix)

impute the data by mean value and if the feature is categorical then by mode value Related to [SYSTEMDS-2662] dependency function for cleaning pipelines

Parameters:
  • X – Data Matrix (Recoded Matrix for categorical features)

  • imputationVector – column mean vector

Returns:

imputed dataset

systemds.operator.algorithm.imputeByMedian(X: Matrix, mask: Matrix)

Related to [SYSTEMDS-2662] dependency function for cleaning pipelines

impute the data by median value and if the feature is categorical then by mode value

Parameters:
  • X – Data Matrix (Recoded Matrix for categorical features)

  • mask – A 0/1 row vector for identifying numeric (0) and categorical features (1)

Returns:

imputed dataset

systemds.operator.algorithm.imputeByMedianApply(X: Matrix, imputedVec: Matrix)

impute the data by median value and if the feature is categorical then by mode value Related to [SYSTEMDS-2662] dependency function for cleaning pipelines

Parameters:
  • X – Data Matrix (Recoded Matrix for categorical features)

  • imputationVector – column median vector

Returns:

imputed dataset

systemds.operator.algorithm.imputeByMode(X: Matrix)

This function impute the data by mode value Related to [SYSTEMDS-2902] dependency function for cleaning pipelines

Parameters:

X – Data Matrix (Recoded Matrix for categorical features)

Returns:

imputed dataset

systemds.operator.algorithm.imputeByModeApply(X: Matrix, imputedVec: Matrix)

impute the data by most frequent value (recoded data only) Related to [SYSTEMDS-2662] dependency function for cleaning pipelines

Parameters:
  • X – Data Matrix (Recoded Matrix for categorical features)

  • imputationVector – column mean vector

Returns:

imputed dataset

systemds.operator.algorithm.intersect(X: Matrix, Y: Matrix)

Implements set intersection for numeric data

Parameters:
  • X – matrix X, set A

  • Y – matrix Y, set B

Returns:

intersection matrix, set of intersecting items

systemds.operator.algorithm.km(X: Matrix, TE: Matrix, GI: Matrix, SI: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Builtin function that implements the analysis of survival data with KAPLAN-MEIER estimates

Parameters:
  • X – Input matrix X containing the survival data: timestamps, whether event occurred (1) or data is censored (0), and a number of factors (categorical features) for grouping and/or stratifying

  • TE – Column indices of X which contain timestamps (first entry) and event information (second entry)

  • GI – Column indices of X corresponding to the factors to be used for grouping

  • SI – Column indices of X corresponding to the factors to be used for stratifying

  • alpha – Parameter to compute 100*(1-alpha)% confidence intervals for the survivor function and its median

  • err_type – “greenwood” Parameter to specify the error type according to “greenwood” (the default) or “peto”

  • conf_type – Parameter to modify the confidence interval; “plain” keeps the lower and upper bound of the confidence interval unmodified, “log” (the default) corresponds to logistic transformation and “log-log” corresponds to the complementary log-log transformation

  • test_type – If survival data for multiple groups is available specifies which test to perform for comparing survival data across multiple groups: “none” (the default) “log-rank” or “wilcoxon” test

Returns:

Matrix KM whose dimension depends on the number of groups (denoted by g) and strata (denoted by s) in the data: each collection of 7 consecutive columns in KM corresponds to a unique combination of groups and strata in the data with the following schema 1. col: timestamp 2. col: no. at risk 3. col: no. of events 4. col: Kaplan-Meier estimate of survivor function surv 5. col: standard error of surv 6. col: lower 100*(1-alpha)% confidence interval for surv 7. col: upper 100*(1-alpha)% confidence interval for surv

Returns:

Matrix M whose dimension depends on the number of groups (g) and strata (s) in the data (k denotes the number of factors used for grouping ,i.e., ncol(GI) and l denotes the number of factors used for stratifying, i.e., ncol(SI)) M[,1:k]: unique combination of values in the k factors used for grouping M[,(k+1):(k+l)]: unique combination of values in the l factors used for stratifying M[,k+l+1]: total number of records M[,k+l+2]: total number of events M[,k+l+3]: median of surv M[,k+l+4]: lower 100*(1-alpha)% confidence interval of the median of surv M[,k+l+5]: upper 100*(1-alpha)% confidence interval of the median of surv If the number of groups and strata is equal to 1, M will have 4 columns with M[,1]: total number of events M[,2]: median of surv M[,3]: lower 100*(1-alpha)% confidence interval of the median of surv M[,4]: upper 100*(1-alpha)% confidence interval of the median of surv

Returns:

If survival data from multiple groups available and ttype=log-rank or wilcoxon, a 1 x 4 matrix T and an g x 5 matrix T_GROUPS_OE with T_GROUPS_OE[,1] = no. of events T_GROUPS_OE[,2] = observed value (O) T_GROUPS_OE[,3] = expected value (E) T_GROUPS_OE[,4] = (O-E)^2/E T_GROUPS_OE[,5] = (O-E)^2/V T[1,1] = no. of groups T[1,2] = degree of freedom for Chi-squared distributed test statistic T[1,3] = test statistic T[1,4] = P-value

systemds.operator.algorithm.kmeans(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Builtin function that implements the k-Means clustering algorithm

Parameters:
  • X – The input Matrix to do KMeans on.

  • k – Number of centroids

  • runs – Number of runs (with different initial centroids)

  • max_iter – Maximum number of iterations per run

  • eps – Tolerance (epsilon) for WCSS change ratio

  • is_verbose – do not print per-iteration stats

  • avg_sample_size_per_centroid – Average number of records per centroid in data samples

  • seed – The seed used for initial sampling. If set to -1 random seeds are selected.

Returns:

The mapping of records to centroids

Returns:

The output matrix with the centroids

systemds.operator.algorithm.kmeansPredict(X: Matrix, C: Matrix)

Builtin function that does predictions based on a set of centroids provided.

Parameters:
  • X – The input Matrix to do KMeans on.

  • C – The input Centroids to map X onto.

Returns:

The mapping of records to centroids

systemds.operator.algorithm.knn(Train: Matrix, Test: Matrix, CL: Matrix, START_SELECTED: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This script implements KNN (K Nearest Neighbor) algorithm.

Parameters:
  • Train – The input matrix as features

  • Test – The input matrix for nearest neighbor search

  • CL – The input matrix as target

  • CL_T – The target type of matrix CL whether columns in CL are continuous ( =1 ) or categorical ( =2 ) or not specified ( =0 )

  • trans_continuous – Option flag for continuous feature transformed to [-1,1]: FALSE = do not transform continuous variable; TRUE = transform continuous variable;

  • k_value – k value for KNN, ignore if select_k enable

  • select_k – Use k selection algorithm to estimate k (TRUE means yes)

  • k_min – Min k value( available if select_k = 1 )

  • k_max – Max k value( available if select_k = 1 )

  • select_feature – Use feature selection algorithm to select feature (TRUE means yes)

  • feature_max – Max feature selection

  • interval – Interval value for K selecting ( available if select_k = 1 )

  • feature_importance – Use feature importance algorithm to estimate each feature (TRUE means yes)

  • predict_con_tg – Continuous target predict function: mean(=0) or median(=1)

  • START_SELECTED – feature selection initial value

Returns:

Applied clusters to X

Returns:

Cluster matrix

Returns:

Feature importance value

systemds.operator.algorithm.knnGraph(X: Matrix, k: int)

Builtin for k nearest neighbor graph construction

Parameters:
  • X

  • k

Returns:

systemds.operator.algorithm.knnbf(X: Matrix, T: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This script implements KNN (K Nearest Neighbor) algorithm.

Parameters:
  • X

  • T

  • k_value

Returns:

systemds.operator.algorithm.l2svm(X: Matrix, Y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This builting function implements binary-class Support Vector Machine (SVM) with squared slack variables (l2 regularization).

Parameters:
  • X – Feature matrix X (shape: m x n)

  • Y – Label vector y of class labels (shape: m x 1), assumed binary in -1/+1 or 1/2 encoding.

  • intercept – Indicator if a bias column should be added to X and the model

  • epsilon – Tolerance for early termination if the reduction of objective function is less than epsilon times the initial objective

  • reg – Regularization parameter (lambda) for L2 regularization

  • maxIterations – Maximum number of conjugate gradient (outer) iterations

  • maxii – Maximum number of line search (inner) iterations

  • verbose – Indicator if training details should be printed

  • columnId – An optional class ID used in verbose print output, eg. used when L2SVM is used in MSVM.

Returns:

Trained model/weights (shape: n x 1, w/ intercept: n+1)

systemds.operator.algorithm.l2svmPredict(X: Matrix, W: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Builtin function Implements binary-class SVM with squared slack variables.

Parameters:
  • X – matrix X of feature vectors to classify

  • W – matrix of the trained variables

  • verbose – Set to true if one wants print statements.

Returns:

Classification Labels Raw, meaning not modified to clean labels of 1’s and -1’s

Returns:

Classification Labels Maxed to ones and zeros.

systemds.operator.algorithm.lasso(X: Matrix, y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Builtin function for the SpaRSA algorithm to perform lasso regression (SpaRSA .. Sparse Reconstruction by Separable Approximation)

Parameters:
  • X – input feature matrix

  • y – matrix Y columns of the design matrix

  • tol – target convergence tolerance

  • M – history length

  • tau – regularization component

  • maxi – maximum number of iterations until convergence

  • verbose – if the builtin should be verbose

Returns:

model matrix

systemds.operator.algorithm.lenetPredict(model: List, X: Matrix, C: int, Hin: int, Win: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This builtin function makes prediction given data and trained LeNet model

Parameters:
  • model – Trained LeNet model

  • X – Input data matrix, of shape (N, C*Hin*Win)

  • C – Number of input channels

  • Hin – Input height

  • Win – Input width

  • batch_size – Batch size

Returns:

Predicted values

systemds.operator.algorithm.lenetTrain(X: Matrix, Y: Matrix, X_val: Matrix, Y_val: Matrix, C: int, Hin: int, Win: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This builtin function trains LeNet CNN. The architecture of the networks is:conv1 -> relu1 -> pool1 -> conv2 -> relu2 -> pool2 -> affine3 -> relu3 -> affine4 -> softmax

Parameters:
  • X – Input data matrix, of shape (N, C*Hin*Win)

  • Y – Target matrix, of shape (N, K)

  • X_val – Validation data matrix, of shape (N, C*Hin*Win)

  • Y_val – Validation target matrix, of shape (N, K)

  • C – Number of input channels (dimensionality of input depth)

  • Hin – Input width

  • Win – Input height

  • batch_size – Batch size

  • epochs – Number of epochs

  • lr – Learning rate

  • mu – Momentum value

  • decay – Learning rate decay

  • reg – Regularization strength

  • seed – Seed for model initialization

  • verbose – Flag indicates if function should print to stdout

Returns:

Trained model which can be used in lenetPredict

systemds.operator.algorithm.lm(X: Matrix, y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The lm-function solves linear regression using either the direct solve method or the conjugate gradient algorithm depending on the input size of the matrices (See lmDS-function and lmCG-function respectively).

Parameters:
  • X – Matrix of feature vectors.

  • y – 1-column matrix of response values.

  • icpt – Intercept presence, shifting and rescaling the columns of X

  • reg – Regularization constant (lambda) for L2-regularization. set to nonzero for highly dependant/sparse/numerous features

  • tol – Tolerance (epsilon); conjugate gradient procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm

  • maxi – Maximum number of conjugate gradient iterations. 0 = no maximum

  • verbose – If TRUE print messages are activated

Returns:

The model fit

systemds.operator.algorithm.lmCG(X: Matrix, y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The lmCG function solves linear regression using the conjugate gradient algorithm

Parameters:
  • X – Matrix of feature vectors.

  • y – 1-column matrix of response values.

  • icpt – Intercept presence, shifting and rescaling the columns of X

  • reg – Regularization constant (lambda) for L2-regularization. set to nonzero for highly dependant/sparse/numerous features

  • tol – Tolerance (epsilon); conjugate gradient procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm

  • maxi – Maximum number of conjugate gradient iterations. 0 = no maximum

  • verbose – If TRUE print messages are activated

Returns:

The model fit

systemds.operator.algorithm.lmDS(X: Matrix, y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The lmDC function solves linear regression using the direct solve method

Parameters:
  • X – Matrix of feature vectors.

  • y – 1-column matrix of response values.

  • icpt – Intercept presence, shifting and rescaling the columns of X

  • reg – Regularization constant (lambda) for L2-regularization. set to nonzero for highly dependant/sparse/numerous features

  • tol – Tolerance (epsilon); conjugate gradient procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm

  • maxi – Maximum number of conjugate gradient iterations. 0 = no maximum

  • verbose – If TRUE print messages are activated

Returns:

The model fit

systemds.operator.algorithm.lmPredict(X: Matrix, B: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The lmPredict-function predicts the class of a feature vector

Parameters:
  • X – Matrix of feature vectors

  • B – 1-column matrix of weights.

  • ytest – test labels, used only for verbose output. can be set to matrix(0,1,1) if verbose output is not wanted

  • icpt – Intercept presence, shifting and rescaling the columns of X

  • verbose – If TRUE print messages are activated

Returns:

1-column matrix of classes

systemds.operator.algorithm.logSumExp(M: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Built-in LOGSUMEXP

Parameters:
  • M – matrix to perform Log sum exp on.

  • margin – if the logsumexp of rows is required set margin = “row” if the logsumexp of columns is required set margin = “col” if set to “none” then a single scalar is returned computing logsumexp of matrix

Returns:

a 1*1 matrix, row vector or column vector depends on margin value

systemds.operator.algorithm.matrixProfile(ts: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Builtin function that computes the MatrixProfile of a time series efficiently using the SCRIMP++ algorithm.

References:
Yan Zhu et al.. 2018.
  Matrix Profile XI: SCRIMP++: Time Series Motif Discovery at Interactive Speeds.
  2018 IEEE International Conference on Data Mining (ICDM), 2018, pp. 837-846.
  DOI: 10.1109/ICDM.2018.00099.
  https://www.cs.ucr.edu/~eamonn/SCRIMP_ICDM_camera_ready_updated.pdf
Parameters:
  • ts – Time series to profile

  • window_size – Sliding window size

  • sample_percent – Degree of approximation between zero and one (1 computes the exact solution)

  • is_verbose – Print debug information

Returns:

The computed matrix profile

Returns:

Indices of least distances

systemds.operator.algorithm.mcc(predictions: Matrix, labels: Matrix)

Built-in function mcc: Matthews’ Correlation Coefficient for binary classification evaluation

Parameters:
  • predictions – Vector of predicted 0/1 values. (requires setting ‘labels’ parameter)

  • labels – Vector of 0/1 labels.

Returns:

Matthews’ Correlation Coefficient

systemds.operator.algorithm.mdedup(X: Frame, LHSfeatures: Matrix, LHSthreshold: Matrix, RHSfeatures: Matrix, RHSthreshold: Matrix, verbose: bool)

Implements builtin for deduplication using matching dependencies (e.g. Street 0.95, City 0.90 -> ZIP 1.0) and Jaccard distance.

Parameters:
  • X – Input Frame X

  • LHSfeatures – A matrix 1xd with numbers of columns for MDs (e.g. Street 0.95, City 0.90 -> ZIP 1.0)

  • LHSthreshold – A matrix 1xd with threshold values in interval [0, 1] for MDs

  • RHSfeatures – A matrix 1xd with numbers of columns for MDs

  • RHSthreshold – A matrix 1xd with threshold values in interval [0, 1] for MDs

  • verbose – To print the output

Returns:

Matrix nx1 of duplicates

systemds.operator.algorithm.mice(X: Matrix, cMask: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This Builtin function implements multiple imputation using Chained Equations (MICE)

Assumption missing value are represented with empty string i.e “,,” in CSV file variables with suffix n are storing continuos/numeric data and variables with suffix c are storing categorical data

Parameters:
  • X – Data Matrix (Recoded Matrix for categorical features)

  • cMask – A 0/1 row vector for identifying numeric (0) and categorical features (1)

  • iter – Number of iteration for multiple imputations

  • threshold – confidence value [0, 1] for robust imputation, values will only be imputed if the predicted value has probability greater than threshold, only applicable for categorical data

  • verbose – Boolean value.

Returns:

imputed dataset

systemds.operator.algorithm.miceApply(X: Matrix, meta: Matrix, threshold: float, dM: Frame, betaList: List)

This Builtin function implements multiple imputation using Chained Equations (MICE)

Assumption missing value are represented with empty string i.e “,,” in CSV file variables with suffix n are storing continuos/numeric data and variables with suffix c are storing categorical data

Parameters:
  • X – Data Matrix (Recoded Matrix for categorical features)

  • mtea – A meta matrix with each rows storing values 1) mask of original matrix, 2) information of columns with missing values on original data 0 for no missing value in column and 1 otherwise 3) dist values in each columns in original data 1 for continuous columns and colMax for categorical

  • threshold – confidence value [0, 1] for robust imputation, values will only be imputed if the predicted value has probability greater than threshold, only applicable for categorical data

  • dM – meta frame from OHE on original data

  • betaList – List of machine learning models trained for each column imputation

  • verbose – Boolean value.

Returns:

imputed dataset

systemds.operator.algorithm.msvm(X: Matrix, Y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This builtin function implements a multi-class Support Vector Machine (SVM) with squared slack variables. The trained model comprises #classes one-against-the-rest binary-class l2svm classification models.

Parameters:
  • X – Feature matrix X (shape: m x n)

  • Y – Label vector y of class labels (shape: m x 1), where max(Y) is assumed to be the number of classes

  • intercept – Indicator if a bias column should be added to X and the model

  • epsilon – Tolerance for early termination if the reduction of objective function is less than epsilon times the initial objective

  • reg – Regularization parameter (lambda) for L2 regularization

  • maxIterations – Maximum number of conjugate gradient (outer l2svm) iterations

  • verbose – Indicator if training details should be printed

Returns:

Trained model/weights (shape: n x max(Y), w/ intercept: n+1)

systemds.operator.algorithm.msvmPredict(X: Matrix, W: Matrix)

This Scripts helps in applying an trained MSVM

Parameters:
  • X – matrix X of feature vectors to classify

  • W – matrix of the trained variables

Returns:

Classification Labels Raw, meaning not modified to clean Labeles of 1’s and -1’s

Returns:

Classification Labels Maxed to ones and zeros.

systemds.operator.algorithm.multiLogReg(X: Matrix, Y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Solves Multinomial Logistic Regression using Trust Region method. (See: Trust Region Newton Method for Logistic Regression, Lin, Weng and Keerthi, JMLR 9 (2008) 627-650) The largest label represents the baseline category; if label -1 or 0 is present, then it is the baseline label (and it is converted to the largest label).

Parameters:
  • X – Location to read the matrix of feature vectors

  • Y – Location to read the matrix with category labels

  • icpt – Intercept presence, shifting and rescaling X columns: 0 = no intercept, no shifting, no rescaling; 1 = add intercept, but neither shift nor rescale X; 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1

  • tol – tolerance (“epsilon”)

  • reg – regularization parameter (lambda = 1/C); intercept is not regularized

  • maxi – max. number of outer (Newton) iterations

  • maxii – max. number of inner (conjugate gradient) iterations, 0 = no max

  • verbose – flag specifying if logging information should be printed

Returns:

regression betas as output for prediction

systemds.operator.algorithm.multiLogRegPredict(X: Matrix, B: Matrix, Y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

THIS SCRIPT APPLIES THE ESTIMATED PARAMETERS OF MULTINOMIAL LOGISTIC REGRESSION TO A NEW (TEST) DATASET Matrix M of predicted means/probabilities, some statistics in CSV format (see below)

Parameters:
  • X – Data Matrix X

  • B – Regression parameters betas

  • Y – Response vector Y

  • verbose – flag specifying if logging information should be printed

Returns:

Matrix M of predicted means/probabilities

Returns:

Predicted response vector

Returns:

scalar value of accuracy

systemds.operator.algorithm.na_locf(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Builtin function for imputing missing values using forward fill and backward fill techniques

Parameters:
  • X – Matrix X

  • option – String “locf” (last observation moved forward) to do forward fill “nocb” (next observation carried backward) to do backward fill

  • verbose – to print output on screen

Returns:

Matrix with no missing values

systemds.operator.algorithm.naiveBayes(D: Matrix, C: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The naiveBayes-function computes the class conditional probabilities and class priors.

Parameters:
  • D – One dimensional column matrix with N rows.

  • C – One dimensional column matrix with N rows.

  • laplace – Any Double value.

  • verbose – Boolean value.

Returns:

Class priors, One dimensional column matrix with N rows.

Returns:

Class conditional probabilities, One dimensional column matrix with N rows.

systemds.operator.algorithm.naiveBayesPredict(X: Matrix, P: Matrix, C: Matrix)

The naiveBaysePredict-function predicts the scoring with a naive Bayes model.

Parameters:
  • X – Matrix of test data with N rows.

  • P – Class priors, One dimensional column matrix with N rows.

  • C – Class conditional probabilities, matrix with N rows

Returns:

A matrix containing the top-K item-ids with highest predicted ratings.

Returns:

A matrix containing predicted ratings.

systemds.operator.algorithm.normalize(X: Matrix)

Min-max normalization (a.k.a. min-max scaling) to range [0,1]. For matrices of positive values, this normalization preserves the input sparsity.

Parameters:

X – Input feature matrix of shape n-by-m

Returns:

Modified output feature matrix of shape n-by-m

Returns:

Column minima of shape 1-by-m

Returns:

Column maxima of shape 1-by-m

systemds.operator.algorithm.normalizeApply(X: Matrix, cmin: Matrix, cmax: Matrix)

Min-max normalization (a.k.a. min-max scaling) to range [0,1], given existing min-max ranges. For matrices of positive values, this normalization preserves the input sparsity. The validity of the provided min-max range and post-processing is under control of the caller.

Parameters:
  • X – Input feature matrix of shape n-by-m

  • cmin – Colunm minima of shape 1-by-m

  • cmax – Column maxima of shape 1-by-m

Returns:

Modified output feature matrix of shape n-by-m

systemds.operator.algorithm.outlier(X: Matrix, opposite: bool)

This outlier-function takes a matrix data set as input from where it determines which point(s) have the largest difference from mean.

Parameters:
  • X – Matrix of Recoded dataset for outlier evaluation

  • opposite – (1)TRUE for evaluating outlier from upper quartile range, (0)FALSE for evaluating outlier from lower quartile range

Returns:

matrix indicating outlier values

systemds.operator.algorithm.outlierByArima(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Built-in function for detecting and repairing outliers in time series, by training an ARIMA model and classifying values that are more than k standard-deviations away from the predicated values as outliers.

Parameters:
  • X – Matrix X

  • k – threshold values 1, 2, 3 for 68%, 95%, 99.7% respectively (3-sigma rule)

  • repairMethod – values: 0 = delete rows having outliers, 1 = replace outliers as zeros 2 = replace outliers as missing values

  • p – non-seasonal AR order

  • d – non-seasonal differencing order

  • q – non-seasonal MA order

  • P – seasonal AR order

  • D – seasonal differencing order

  • Q – seasonal MA order

  • s – period in terms of number of time-steps

  • include_mean – If the mean should be included

  • solver – solver, is either “cg” or “jacobi”

Returns:

Matrix X with no outliers

systemds.operator.algorithm.outlierByIQR(X: Matrix, k: float, max_iterations: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Builtin function for detecting and repairing outliers using standard deviation

Parameters:
  • X – Matrix X

  • k – a constant used to discern outliers k*IQR

  • isIterative – iterative repair or single repair

  • repairMethod – values: 0 = delete rows having outliers, 1 = replace outliers with zeros 2 = replace outliers as missing values

  • max_iterations – values: 0 = arbitrary number of iteraition until all outliers are removed, n = any constant defined by user

  • verbose – flag specifying if logging information should be printed

Returns:

Matrix X with no outliers

systemds.operator.algorithm.outlierByIQRApply(X: Matrix, Q1: Matrix, Q3: Matrix, IQR: Matrix, k: float, repairMethod: int)

Builtin function for repairing outliers by IQR

Parameters:
  • X – Matrix X

  • Q1 – first quartile

  • Q3 – third quartile

  • IQR – Inter-quartile range

  • k – a constant used to discern outliers k*IQR

  • repairMethod – values: 0 = delete rows having outliers, 1 = replace outliers with zeros 2 = replace outliers as missing values

Returns:

Matrix X with no outliers

systemds.operator.algorithm.outlierBySd(X: Matrix, max_iterations: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Builtin function for detecting and repairing outliers using standard deviation

Parameters:
  • X – Matrix X

  • k – threshold values 1, 2, 3 for 68%, 95%, 99.7% respectively (3-sigma rule)

  • repairMethod – values: 0 = delete rows having outliers, 1 = replace outliers as zeros 2 = replace outliers as missing values

  • max_iterations – values: 0 = arbitrary number of iteration until all outliers are removed, n = any constant defined by user

Returns:

Matrix X with no outliers

systemds.operator.algorithm.outlierBySdApply(X: Matrix, colMean: Matrix, colSD: Matrix, k: float, repairMethod: int)

Builtin function for detecting and repairing outliers using standard deviation

Parameters:
  • X – Matrix X

  • colMean – Matrix X

  • k – a constant used to discern outliers k*IQR

  • isIterative – iterative repair or single repair

  • repairMethod – values: 0 = delete rows having outliers, 1 = replace outliers with zeros 2 = replace outliers as missing values

  • max_iterations – values: 0 = arbitrary number of iteraition until all outliers are removed, n = any constant defined by user

  • verbose – flag specifying if logging information should be printed

Returns:

Matrix X with no outliers

systemds.operator.algorithm.pca(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The function Principal Component Analysis (PCA) is used for dimensionality reduction

Parameters:
  • X – Input feature matrix

  • K – Number of reduced dimensions (i.e., columns)

  • Center – Indicates whether or not to center the feature matrix

  • Scale – Indicates whether or not to scale the feature matrix

Returns:

Output feature matrix with K columns

Returns:

Output dominant eigen vectors (can be used for projections)

Returns:

The column means of the input, subtracted to construct the PCA

Returns:

The Scaling of the values, to make each dimension same size.

systemds.operator.algorithm.pcaInverse(Y: Matrix, Clusters: Matrix, Centering: Matrix, ScaleFactor: Matrix)

Principal Component Analysis (PCA) for reconstruction of approximation of the original data. This methods allows to reconstruct an approximation of the original matrix, and is useful for calculating how much information is lost in the PCA.

Parameters:
  • Y – Input features that have PCA applied to them

  • Clusters – The previous PCA components computed

  • Centering – The column means of the PCA model, subtracted to construct the PCA

  • ScaleFactor – The scaling of each dimension in the PCA model

Returns:

Output feature matrix reconstructing and approximation of the original matrix

systemds.operator.algorithm.pcaTransform(X: Matrix, Clusters: Matrix, Centering: Matrix, ScaleFactor: Matrix)

Principal Component Analysis (PCA) for dimensionality reduction prediction This method is used to transpose data, which the PCA model was not trained on. To validate how good The PCA is, and to apply in production.

Parameters:
  • X – Input feature matrix

  • Clusters – The previously computed principal components

  • Centering – The column means of the PCA model, subtracted to construct the PCA

  • ScaleFactor – The scaling of each dimension in the PCA model

Returns:

Output feature matrix dimensionally reduced by PCA

systemds.operator.algorithm.pnmf(X: Matrix, rnk: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The pnmf-function implements Poisson Non-negative Matrix Factorization (PNMF). Matrix X is factorized into two non-negative matrices, W and H based on Poisson probabilistic assumption. This non-negativity makes the resulting matrices easier to inspect.

[Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang: Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. WWW 2010: 681-690]

Parameters:
  • X – Matrix of feature vectors.

  • rnk – Number of components into which matrix X is to be factored.

  • eps – Tolerance

  • maxi – Maximum number of conjugate gradient iterations.

  • verbose – If TRUE, ‘iter’ and ‘obj’ are printed.

Returns:

List of pattern matrices, one for each repetition.

Returns:

List of amplitude matrices, one for each repetition.

systemds.operator.algorithm.ppca(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This script performs Probabilistic Principal Component Analysis (PCA) on the given input data. It is based on paper: sPCA: Scalable Principal Component Analysis for Big Data on Distributed Platforms. Tarek Elgamal et.al.

Parameters:
  • X – n x m input feature matrix

  • k – indicates dimension of the new vector space constructed from eigen vectors

  • maxi – maximum number of iterations until convergence

  • tolobj – objective function tolerance value to stop ppca algorithm

  • tolrecerr – reconstruction error tolerance value to stop the algorithm

  • verbose – verbose debug output

Returns:

Output feature matrix with K columns

Returns:

Output dominant eigen vectors (can be used for projections)

systemds.operator.algorithm.randomForest(X: Matrix, Y: Matrix, R: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This script implement classification random forest with both scale and categorical features.

Parameters:
  • X – Feature matrix X; note that X needs to be both recoded and dummy coded

  • Y – Label matrix Y; note that Y needs to be both recoded and dummy coded

  • R – Matrix which for each feature in X contains the following information - R[,1]: column ids TODO pass recorded and binned - R[,2]: start indices - R[,3]: end indices If R is not provided by default all variables are assumed to be scale

  • bins – Number of equiheight bins per scale feature to choose thresholds

  • depth – Maximum depth of the learned tree

  • num_leaf – Number of samples when splitting stops and a leaf node is added

  • num_samples – Number of samples at which point we switch to in-memory subtree building

  • num_trees – Number of trees to be learned in the random forest model

  • subsamp_rate – Parameter controlling the size of each tree in the forest; samples are selected from a Poisson distribution with parameter subsamp_rate (the default value is 1.0)

  • feature_subset – Parameter that controls the number of feature used as candidates for splitting at each tree node as a power of number of features in the dataset; by default square root of features (i.e., feature_subset = 0.5) are used at each tree node

  • impurity – Impurity measure: entropy or Gini (the default)

Returns:

Matrix M containing the learned tree, where each column corresponds to a node in the learned tree and each row contains the following information: M[1,j]: id of node j (in a complete binary tree) M[2,j]: tree id to which node j belongs M[3,j]: Offset (no. of columns) to left child of j M[4,j]: Feature index of the feature that node j looks at if j is an internal node, otherwise 0 M[5,j]: Type of the feature that node j looks at if j is an internal node: 1 for scale and 2 for categorical features, otherwise the label that leaf node j is supposed to predict M[6,j]: 1 if j is an internal node and the feature chosen for j is scale, otherwise the size of the subset of values stored in rows 7,8,… if j is categorical M[7:,j]: Only applicable for internal nodes. Threshold the example’s feature value is compared to is stored at M[7,j] if the feature chosen for j is scale; If the feature chosen for j is categorical rows 7,8,… depict the value subset chosen for j

Returns:

Matrix C containing the number of times samples are chosen in each tree of the random forest

Returns:

Mappings from scale feature ids to global feature ids

Returns:

Mappings from categorical feature ids to global feature ids

systemds.operator.algorithm.scale(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This function scales and center individual features in the input matrix (column wise.) using z-score to scale the values.

Parameters:
  • X – Input feature matrix

  • center – Indicates whether or not to center the feature matrix

  • scale – Indicates whether or not to scale the feature matrix

Returns:

Output feature matrix with K columns

Returns:

The column means of the input, subtracted if Center was TRUE

Returns:

The Scaling of the values, to make each dimension have similar value ranges

systemds.operator.algorithm.scaleApply(X: Matrix, Centering: Matrix, ScaleFactor: Matrix)

This function scales and center individual features in the input matrix (column wise.) using the input matrices.

Parameters:
  • X – Input feature matrix

  • Centering – The column means to subtract from X (not done if empty)

  • ScaleFactor – The column scaling to multiply with X (not done if empty)

Returns:

Output feature matrix with K columns

systemds.operator.algorithm.scaleMinMax(X: Matrix)

This function performs min-max normalization (rescaling to [0,1]).

Parameters:

X – Input feature matrix

Returns:

Scaled output matrix

systemds.operator.algorithm.selectByVarThresh(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This function drops feature with <= thresh variance (by default drop constants).

Parameters:
  • X – Matrix of feature vectors.

  • thresh – The threshold for to drop

Returns:

Matrix of feature vectors with <= thresh variance.

systemds.operator.algorithm.setdiff(X: Matrix, Y: Matrix)

Builtin function that implements difference operation on vectors

Parameters:
  • X – input vector

  • Y – input vector

Returns:

vector with all elements that are present in X but not in Y

systemds.operator.algorithm.sherlock(X_train: Matrix, y_train: Matrix)

This function implements training phase of Sherlock: A Deep Learning Approach to Semantic Data Type Detection

[Hulsebos, Madelon, et al. “Sherlock: A deep learning approach to semantic data type detection.” Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019.]

Split feature matrix into four different feature categories and train neural networks on the respective single features. Then combine all trained features to train final neural network.

Parameters:
  • X_train – matrix of feature vectors

  • y_train – matrix Y of class labels of semantic data type

Returns:

weights (parameters) matrices for character distributions

Returns:

biases vectors for character distributions

Returns:

weights (parameters) matrices for word embeddings

Returns:

biases vectors for word embeddings

Returns:

weights (parameters) matrices for paragraph vectors

Returns:

biases vectors for paragraph vectors

Returns:

weights (parameters) matrices for global statistics

Returns:

biases vectors for global statistics

Returns:

weights (parameters) matrices for combining all trained features (final)

Returns:

biases vectors for combining all trained features (final)

systemds.operator.algorithm.sherlockPredict(X: Matrix, cW1: Matrix, cb1: Matrix, cW2: Matrix, cb2: Matrix, cW3: Matrix, cb3: Matrix, wW1: Matrix, wb1: Matrix, wW2: Matrix, wb2: Matrix, wW3: Matrix, wb3: Matrix, pW1: Matrix, pb1: Matrix, pW2: Matrix, pb2: Matrix, pW3: Matrix, pb3: Matrix, sW1: Matrix, sb1: Matrix, sW2: Matrix, sb2: Matrix, sW3: Matrix, sb3: Matrix, fW1: Matrix, fb1: Matrix, fW2: Matrix, fb2: Matrix, fW3: Matrix, fb3: Matrix)

This function implements prediction and evaluation phase of Sherlock: Split feature matrix into four different feature categories and predicting the class probability on the respective features. Then combine all predictions for final predicted probabilities. A Deep Learning Approach to Semantic Data Type Detection. [Hulsebos, Madelon, et al. “Sherlock: A deep learning approach to semantic data type detection.” Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019.]

Parameters:
  • X – matrix of values which are to be classified

  • cW – weights (parameters) matrices for character distribtions

  • cb – biases vectors for character distribtions

  • wW – weights (parameters) matrices for word embeddings

  • wb – biases vectors for word embeddings

  • pW – weights (parameters) matrices for paragraph vectors

  • pb – biases vectors for paragraph vectors

  • sW – weights (parameters) matrices for global statistics

  • sb – biases vectors for global statistics

  • fW – weights (parameters) matrices for combining all trained features (final)

  • fb – biases vectors for combining all trained features (final)

Returns:

class probabilities of shape (N, K)

systemds.operator.algorithm.shortestPath(G: Matrix, sourceNode: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Computes the minimum distances (shortest-path) between a single source vertex and every other vertex in the graph.

Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bilk, James C. Dehnert, Ikkan Horn, Naty Leiser and Grzegorz Czajkowski: Pregel: A System for Large-Scale Graph Processing, SIGMOD 2010

Parameters:
  • G – adjacency matrix of the labeled graph: Such graph can be directed (G is symmetric) or undirected (G is not symmetric). The values of G can be 0/1 (just specifying whether the nodes are connected or not) or integer values (representing the weight of the edges or the distances between nodes, 0 if not connected).

  • maxi – Integer max number of iterations accepted (0 for FALSE, i.e. max number of iterations not defined)

  • sourceNode – node index to calculate the shortest paths to all other nodes.

  • verbose – flag for verbose debug output

Returns:

Output matrix (double) of minimum distances (shortest-path) between vertices: The value of the ith row and the jth column of the output matrix is the minimum distance shortest-path from vertex i to vertex j. When the value of the minimum distance is infinity, the two nodes are not connected.

systemds.operator.algorithm.sigmoid(X: Matrix)

The Sigmoid function is a type of activation function, and also defined as a squashing function which limit the output to a range between 0 and 1, which will make these functions useful in the prediction of probabilities.

Parameters:

X – Matrix of feature vectors.

Returns:

1-column matrix of weights.

systemds.operator.algorithm.slicefinder(X: Matrix, e: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This builtin function implements SliceLine, a linear-algebra-based ML model debugging technique for finding the top-k data slices where a trained models performs significantly worse than on the overall dataset. For a detailed description and experimental results, see: Svetlana Sagadeeva, Matthias Boehm: SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging.(SIGMOD 2021)

Parameters:
  • X – Recoded dataset into Matrix

  • e – Trained model

  • k – Number of subsets required

  • maxL – maximum level L (conjunctions of L predicates), 0 unlimited

  • minSup – minimum support (min number of rows per slice)

  • alpha – weight [0,1]: 0 only size, 1 only error

  • tpEval – flag for task-parallel slice evaluation, otherwise data-parallel

  • tpBlksz – block size for task-parallel execution (num slices)

  • selFeat – flag for removing one-hot-encoded features that don’t satisfy the initial minimum-support constraint and/or have zero error

  • verbose – flag for verbose debug output

Returns:

top-k slices (k x ncol(X) if successful)

Returns:

score, size, error of slices (k x 3)

Returns:

debug matrix, populated with enumeration stats if verbose

systemds.operator.algorithm.smote(X: Matrix, mask: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

Builtin function for handing class imbalance using Synthetic Minority Over-sampling Technique (SMOTE) by Nitesh V. Chawla et. al. In Journal of Artificial Intelligence Research 16 (2002). 321–357

Parameters:
  • X – Matrix of minority class samples

  • mask – 0/1 mask vector where 0 represent numeric value and 1 represent categorical value

  • s – Amount of SMOTE (percentage of oversampling), integral multiple of 100

  • k – Number of nearest neighbor

  • verbose – if the algorithm should be verbose

Returns:

Matrix of (N/100)-1 * nrow(X) synthetic minority class samples

systemds.operator.algorithm.softmax(S: Matrix)

Performs softmax on the given input matrix.

Parameters:

S – Inputs of shape (N, D).

Returns:

Outputs of shape (N, D).

systemds.operator.algorithm.split(X: Matrix, Y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This function split input data X and Y into contiguous or samples train/test sets

Parameters:
  • X – Input feature matrix

  • Y – Input Labels

  • f – Train set fraction [0,1]

  • cont – contiguous splits, otherwise sampled

  • seed – The seed to randomly select rows in sampled mode

Returns:

Train split of feature matrix

Returns:

Test split of feature matrix

Returns:

Train split of label matrix

Returns:

Test split of label matrix

systemds.operator.algorithm.splitBalanced(X: Matrix, Y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This functions split input data X and Y into contiguous balanced ratio Related to [SYSTEMDS-2902] dependency function for cleaning pipelines

Parameters:
  • X – Input feature matrix

  • Y – Input Labels

  • f – Train set fraction [0,1]

  • verbose – print available

Returns:

Train split of feature matrix

Returns:

Test split of feature matrix

Returns:

Train split of label matrix

Returns:

Test split of label matrix

systemds.operator.algorithm.stableMarriage(P: Matrix, A: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This script computes a solution for the stable marriage problem.

result description:

If cell [i,j] is non-zero, it means that acceptor i has matched with proposer j. Further, if cell [i,j] is non-zero, it holds the preference value that led to the match. Proposers.mtx: 2.0,1.0,3.0 1.0,2.0,3.0 1.0,3.0,2.0

Since ordered=TRUE, this means that proposer 1 (row 1) likes acceptor 2 the most, followed by acceptor 1 and acceptor 3. If ordered=FALSE, this would mean that proposer 1 (row 1) likes acceptor 3 the most (since the value at [1,3] is the row max), followed by acceptor 1 (2.0 preference value) and acceptor 2 (1.0 preference value).

Acceptors.mtx: 3.0,1.0,2.0 2.0,1.0,3.0 3.0,2.0,1.0

Since ordered=TRUE, this means that acceptor 1 (row 1) likes proposer 3 the most, followed by proposer 1 and proposer 2. If ordered=FALSE, this would mean that acceptor 1 (row 1) likes proposer 1 the most (since the value at [1,1] is the row max), followed by proposer 3 (2.0 preference value) and proposer 2 (1.0 preference value).

Output.mtx (assuming ordered=TRUE): 0.0,0.0,3.0 0.0,3.0,0.0 1.0,0.0,0.0

Acceptor 1 has matched with proposer 3 (since [1,3] is non-zero) at a preference level of 3.0. Acceptor 2 has matched with proposer 2 (since [2,2] is non-zero) at a preference level of 3.0. Acceptor 3 has matched with proposer 1 (since [3,1] is non-zero) at a preference level of 1.0.

Parameters:
  • P – proposer matrix P. It must be a square matrix with no zeros.

  • A – acceptor matrix A. It must be a square matrix with no zeros.

  • ordered – If true, P and A are assumed to be ordered, i.e. the leftmost value in a row is the most preferred partner’s index. i.e. the leftmost value in a row in P is the preference value for the acceptor with index 1 and vice-versa (higher is better).

  • verbose – if the algorithm should print verbosely

Returns:

Result Matrix

systemds.operator.algorithm.statsNA(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The statsNA-function Print summary stats about the distribution of missing values in a univariate time series.

result matrix contains the following:
  1. Length of time series (including NAs)

  2. Number of Missing Values (NAs)

  3. Percentage of Missing Values (#2/#1)

  4. Number of Gaps (consisting of one or more consecutive NAs)

  5. Average Gap Size - Average size of consecutive NAs for the NA gaps

  6. Longest NA gap - Longest series of consecutive missing values

  7. Most frequent gap size - Most frequently occurring gap size

  8. Gap size accounting for most NAs

Parameters:
  • X – Numeric Vector (‘vector’) object containing NAs

  • bins – Split number for bin stats. Number of bins the time series gets divided into. For each bin information about amount/percentage of missing values is printed.

  • verbose – Print detailed information. For print_only = TRUE, the missing value stats are printed with more information (“Stats for Bins” and “overview NA series”).

Returns:

Column vector where each row correspond to described values

systemds.operator.algorithm.steplm(X: Matrix, y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The steplm-function (stepwise linear regression) implements a classical forward feature selection method. This method iteratively runs what-if scenarios and greedily selects the next best feature until the Akaike information criterion (AIC) does not improve anymore. Each configuration trains a regression model via lm, which in turn calls either the closed form lmDS or iterative lmGC.

return: Matrix of regression parameters (the betas) and its size depend on icpt input value:
        OUTPUT SIZE:   OUTPUT CONTENTS:                HOW TO PREDICT Y FROM X AND B:
icpt=0: ncol(X)   x 1  Betas for X only                Y ~ X %*% B[1:ncol(X), 1], or just X %*% B
icpt=1: ncol(X)+1 x 1  Betas for X and intercept       Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1]
icpt=2: ncol(X)+1 x 2  Col.1: betas for X & intercept  Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1]
                       Col.2: betas for shifted/rescaled X and intercept

In addition, in the last run of linear regression some statistics are provided in CSV format, one comma-separated name-value pair per each line, as follows:

Parameters:
  • X – Location (on HDFS) to read the matrix X of feature vectors

  • Y – Location (on HDFS) to read the 1-column matrix Y of response values

  • icpt – Intercept presence, shifting and rescaling the columns of X: 0 = no intercept, no shifting, no rescaling; 1 = add intercept, but neither shift nor rescale X; 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1

  • reg – learning rate

  • tol – Tolerance threshold to train until achieved

  • maxi – maximum iterations 0 means until tolerance is reached

  • verbose – If the algorithm should be verbose

Returns:

Matrix of regression parameters (the betas) and its size depend on icpt input value.

Returns:

Matrix of selected features ordered as computed by the algorithm.

systemds.operator.algorithm.stratstats(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The stratstats.dml script computes common bivariate statistics, such as correlation, slope, and their p-value, in parallel for many pairs of input variables in the presence of a confounding categorical variable.

Output contains: (1st covariante, 2nd covariante) 40 columns containing the following information: Col 01: 1st covariate X-column number Col 02: 1st covariate global presence count Col 03: 1st covariate global mean Col 04: 1st covariate global standard deviation Col 05: 1st covariate stratified standard deviation Col 06: R-squared, 1st covariate vs. strata Col 07: adjusted R-squared, 1st covariate vs. strata Col 08: P-value, 1st covariate vs. strata Col 09-10: Reserved Col 11: 2nd covariate Y-column number Col 12: 2nd covariate global presence count Col 13: 2nd covariate global mean Col 14: 2nd covariate global standard deviation Col 15: 2nd covariate stratified standard deviation Col 16: R-squared, 2nd covariate vs. strata Col 17: adjusted R-squared, 2nd covariate vs. strata Col 18: P-value, 2nd covariate vs. strata Col 19-20: Reserved Col 21: Global 1st & 2nd covariate presence count Col 22: Global regression slope (2nd vs. 1st covariate) Col 23: Global regression slope standard deviation Col 24: Global correlation = +/- sqrt(R-squared) Col 25: Global residual standard deviation Col 26: Global R-squared Col 27: Global adjusted R-squared Col 28: Global P-value for hypothesis “slope = 0” Col 29-30: Reserved Col 31: Stratified 1st & 2nd covariate presence count Col 32: Stratified regression slope (2nd vs. 1st covariate) Col 33: Stratified regression slope standard deviation Col 34: Stratified correlation = +/- sqrt(R-squared) Col 35: Stratified residual standard deviation Col 36: Stratified R-squared Col 37: Stratified adjusted R-squared Col 38: Stratified P-value for hypothesis “slope = 0” Col 39: Number of strata with at least two counted points Col 40: Reserved

Parameters:
  • X – Matrix X that has all 1-st covariates

  • Y – Matrix Y that has all 2-nd covariates the default value empty means “use X in place of Y”

  • S – Matrix S that has the stratum column the default value empty means “use X in place of S”

  • Xcid – 1-st covariate X-column indices the default value empty means “use columns 1 : ncol(X)”

  • Ycid – 2-nd covariate Y-column indices the default value empty means “use columns 1 : ncol(Y)”

  • Scid – Column index of the stratum column in S

Returns:

Output matrix, one row per each distinct pair

systemds.operator.algorithm.symmetricDifference(X: Matrix, Y: Matrix)

Builtin function that implements symmetric difference set-operation on vectors

Parameters:
  • X – input vector

  • Y – input vector

Returns:

vector with all elements in X and Y but not in both

systemds.operator.algorithm.tSNE(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This function performs dimensionality reduction using tSNE algorithm based on the paper: Visualizing Data using t-SNE, Maaten et. al.

Parameters:
  • X – Data Matrix of shape (number of data points, input dimensionality)

  • reduced_dims – Output dimensionality

  • perplexity – Perplexity Parameter

  • lr – Learning rate

  • momentum – Momentum Parameter

  • max_iter – Number of iterations

  • seed – The seed used for initial values. If set to -1 random seeds are selected.

  • is_verbose – Print debug information

Returns:

Data Matrix of shape (number of data points, reduced_dims)

systemds.operator.algorithm.toOneHot(X: Matrix, numClasses: int)

The toOneHot-function encodes unordered categorical vector to multiple binary vectors.

Parameters:
  • X – Vector with N integer entries between 1 and numClasses

  • numclasses – Number of columns, must be be greater than or equal to largest value in X

Returns:

One-hot-encoded matrix with shape (N, numClasses)

The tomekLink-function performs under sampling by removing Tomek’s links for imbalanced multi-class problems Computes TOMEK links and drops them from data matrix and label vector. Drops only the majority label and corresponding point of TOMEK links.

Parameters:
  • X – Data Matrix (nxm)

  • y – Label Matrix (nx1), greater than zero

Returns:

Data Matrix without Tomek links

Returns:

Labels corresponding to under sampled data

Returns:

Indices of dropped rows/labels wrt input

systemds.operator.algorithm.topk_cleaning(dataTrain: Frame, primitives: Frame, parameters: Frame, evaluationFunc: str, evalFunHp: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

This function cleans top-K item (where K is given as input)for a given list of users. metaData[3, ncol(X)] : metaData[1] stores mask, metaData[2] stores schema, metaData[3] stores FD mask

systemds.operator.algorithm.underSampling(X: Matrix, Y: Matrix, ratio: float)

Builtin to perform random under sampling on data.

Parameters:
  • X – X data to sample from

  • Y – Y data to sample from it will sample the same rows from x.

  • ratio – The ratio to sample

Returns:

The under sample data X

Returns:

The under sample data Y

systemds.operator.algorithm.union(X: Matrix, Y: Matrix)

Builtin function that implements union operation on vectors

Parameters:
  • X – input vector

  • Y – input vector

Returns:

matrix with all unique rows existing in X and Y

systemds.operator.algorithm.univar(X: Matrix, types: Matrix)

Computes univariate statistics for all attributes in a given data set

Parameters:
  • X – Input matrix of the shape (N, D)

  • TYPES – Matrix of the shape (1, D) with features types: 1 for scale, 2 for nominal, 3 for ordinal

Returns:

univariate statistics for all attributes

systemds.operator.algorithm.vectorToCsv(mask: Matrix)

This builtin function convert vector into csv string such as [1 0 0 1 1 0 1] = “1,4,5,7” Related to [SYSTEMDS-2662] dependency function for cleaning pipelines

Parameters:

mask – Data vector (having 0 for excluded indexes)

Returns:

indexes

systemds.operator.algorithm.winsorize(X: Matrix, verbose: bool, **kwargs: Dict[str, DAGNode | str | int | float | bool])

The winsorize-function removes outliers from the data. It does so by computing upper and lower quartile range of the given data then it replaces any value that falls outside this range (less than lower quartile range or more than upper quartile range).

Parameters:
  • X – Input feature matrix

  • verbose – To print output on screen

Returns:

Matrix without outlier values

systemds.operator.algorithm.winsorizeApply(X: Matrix, qLower: Matrix, qUpper: Matrix)

winsorizeApply takes the upper and lower quantile values per column, and remove outliers by replacing them with these upper and lower bound values.

Parameters:
  • X – Input feature matrix

  • qLower – row vector of upper bounds per column

  • qUpper – row vector of lower bounds per column

Returns:

Matrix without outlier values

systemds.operator.algorithm.xdummy1(X: Matrix)

This builtin function is here for debugging purposes

Parameters:

X – test input

Returns:

test result

systemds.operator.algorithm.xdummy2(X: Matrix)

This builtin function is here for debugging purposes

Parameters:

X – Debug input

Returns:

Returns:

systemds.operator.algorithm.xgboost(X: Matrix, y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting. This xgboost implementation supports classification and regression and is capable of working with categorical and scalar features.

Output explained: (the first node is the init prediction) and each row contains the following information: M[1,j]: id of node j (in a complete binary tree) M[2,j]: tree id to which node j belongs M[3,j]: Offset (no. of columns) to left child of j if j is an internal node, otherwise 0 M[4,j]: Feature index of the feature (scale feature id if the feature is scale or categorical feature id if the feature is categorical) that node j looks at if j is an internal node, otherwise 0 M[5,j]: Type of the feature that node j looks at if j is an internal node. if leaf = 0, if scalar = 1, if categorical = 2 M[6:,j]: If j is an internal node: Threshold the example’s feature value is compared to is stored at M[6,j] if the feature chosen for j is scale, otherwise if the feature chosen for j is categorical rows 6,7,… depict the value subset chosen for j If j is a leaf node 1 if j is impure and the number of samples at j > threshold, otherwise 0

Parameters:
  • X – Feature matrix X; note that X needs to be both recoded and dummy coded

  • y – Label matrix y; note that y needs to be both recoded and dummy coded

  • R – Matrix R; 1xn vector which for each feature in X contains the following information - R[,1]: 1 (scalar feature) - R[,2]: 2 (categorical feature) Feature 1 is a scalar feature and features 2 is a categorical feature If R is not provided by default all variables are assumed to be scale (1)

  • sml_type – Supervised machine learning type: 1 = Regression(default), 2 = Classification

  • num_trees – Number of trees to be created in the xgboost model

  • learning_rate – Alias: eta. After each boosting step the learning rate controls the weights of the new predictions

  • max_depth – Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit

  • lambda – L2 regularization term on weights. Increasing this value will make model more conservative and reduce amount of leaves of a tree

Returns:

Matrix M where each column corresponds to a node in the learned tree

systemds.operator.algorithm.xgboostPredictClassification(X: Matrix, M: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting. This xgboost implementation supports classification and is capable of working with categorical features.

Parameters:
  • X – Matrix of feature vectors we want to predict (X_test)

  • M – The model created at xgboost

  • learning_rate – The learning rate used in the model

Returns:

The predictions of the samples using the given xgboost model. (y_prediction)

systemds.operator.algorithm.xgboostPredictRegression(X: Matrix, M: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting. This xgboost implementation supports regression.

Parameters:
  • X – Matrix of feature vectors we want to predict (X_test)

  • M – The model created at xgboost

  • learning_rate – The learning rate used in the model

Returns:

The predictions of the samples using the given xgboost model. (y_prediction)