Algorithms
SystemDS support different Machine learning algorithms out of the box.
As an example the lm algorithm can be used as follows:
# Import numpy and SystemDS
import numpy as np
from systemds.context import SystemDSContext
from systemds.operator.algorithm import lm
# Set a seed
np.random.seed(0)
# Generate matrix of feature vectors
features = np.random.rand(10, 15)
# Generate a 1-column matrix of response values
y = np.random.rand(10, 1)
# compute the weights
with SystemDSContext() as sds:
weights = lm(sds.from_numpy(features), sds.from_numpy(y)).compute()
print(weights)
The output should be similar to
[[-0.11538199]
[-0.20386541]
[-0.39956035]
[ 1.04078623]
[ 0.4327084 ]
[ 0.18954599]
[ 0.49858968]
[-0.26812763]
[ 0.09961844]
[-0.57000751]
[-0.43386048]
[ 0.55358873]
[-0.54638565]
[ 0.2205885 ]
[ 0.37957689]]
- systemds.operator.algorithm.WoE(X: Matrix, Y: Matrix, mask: Matrix)
function Weight of evidence / information gain
- Parameters:
X –
—
Y –
—
mask –
—
- Returns:
Weighted X matrix where the entropy mask is applied
- Returns:
A entropy matrix to apply to data
- systemds.operator.algorithm.WoEApply(X: Matrix, Y: Matrix, entropyMatrix: Matrix)
function Weight of evidence / information gain apply on new data
- Parameters:
X –
—
Y –
—
entropyMatrix –
—
- Returns:
Weighted X matrix where the entropy mask is applied
- systemds.operator.algorithm.abstain(X: Matrix, Y: Matrix, threshold: float, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This function calls the multiLogReg-function in which solves Multinomial Logistic Regression using Trust Region method
- Parameters:
X – matrix of feature vectors
Y – matrix with category labels
threshold – threshold to clear otherwise return X and Y unmodified
verbose – flag specifying if logging information should be printed
- Returns:
abstained output X
- Returns:
abstained output Y
- systemds.operator.algorithm.als(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This script computes an approximate factorization of a low-rank matrix X into two matrices U and V using different implementations of the Alternating-Least-Squares (ALS) algorithm. Matrices U and V are computed by minimizing a loss function (with regularization).
- Parameters:
X – Location to read the input matrix X to be factorized
rank – Rank of the factorization
regType – Regularization: “L2” = L2 regularization; f (U, V) = 0.5 * sum (W * (U %*% V - X) ^ 2) + 0.5 * reg * (sum (U ^ 2) + sum (V ^ 2)) “wL2” = weighted L2 regularization f (U, V) = 0.5 * sum (W * (U %*% V - X) ^ 2) + 0.5 * reg * (sum (U ^ 2 * row_nonzeros) + sum (V ^ 2 * col_nonzeros))
reg – Regularization parameter, no regularization if 0.0
maxi – Maximum number of iterations
check – Check for convergence after every iteration, i.e., updating U and V once
thr – Assuming check is set to TRUE, the algorithm stops and convergence is declared if the decrease in loss in any two consecutive iterations falls below this threshold; if check is FALSE thr is ignored
seed – The seed to random parts of the algorithm
verbose – If the algorithm should run verbosely
- Returns:
An m x r matrix where r is the factorization rank
- Returns:
An m x r matrix where r is the factorization rank
- systemds.operator.algorithm.alsCG(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This script computes an approximate factorization of a low-rank matrix X into two matrices U and V using the Alternating-Least-Squares (ALS) algorithm with conjugate gradient. Matrices U and V are computed by minimizing a loss function (with regularization).
- Parameters:
X – Location to read the input matrix X to be factorized
rank – Rank of the factorization
regType – Regularization: “L2” = L2 regularization; f (U, V) = 0.5 * sum (W * (U %*% V - X) ^ 2) + 0.5 * reg * (sum (U ^ 2) + sum (V ^ 2)) “wL2” = weighted L2 regularization f (U, V) = 0.5 * sum (W * (U %*% V - X) ^ 2) + 0.5 * reg * (sum (U ^ 2 * row_nonzeros) + sum (V ^ 2 * col_nonzeros))
reg – Regularization parameter, no regularization if 0.0
maxi – Maximum number of iterations
check – Check for convergence after every iteration, i.e., updating U and V once
thr – Assuming check is set to TRUE, the algorithm stops and convergence is declared if the decrease in loss in any two consecutive iterations falls below this threshold; if check is FALSE thr is ignored
seed – The seed to random parts of the algorithm
verbose – If the algorithm should run verbosely
- Returns:
An m x r matrix where r is the factorization rank
- Returns:
An m x r matrix where r is the factorization rank
- systemds.operator.algorithm.alsDS(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Alternating-Least-Squares (ALS) algorithm using a direct solve method for individual least squares problems (reg=”L2”). This script computes an approximate factorization of a low-rank matrix V into two matrices L and R. Matrices L and R are computed by minimizing a loss function (with regularization).
- Parameters:
X – Location to read the input matrix V to be factorized
rank – Rank of the factorization
reg – Regularization parameter, no regularization if 0.0
maxi – Maximum number of iterations
check – Check for convergence after every iteration, i.e., updating L and R once
thr – Assuming check is set to TRUE, the algorithm stops and convergence is declared if the decrease in loss in any two consecutive iterations falls below this threshold; if check is FALSE thr is ignored
seed – The seed to random parts of the algorithm
verbose – If the algorithm should run verbosely
- Returns:
An m x r matrix where r is the factorization rank
- Returns:
An m x r matrix where r is the factorization rank
- systemds.operator.algorithm.alsPredict(userIDs: Matrix, I: Matrix, L: Matrix, R: Matrix)
This script computes the rating/scores for a given list of userIDs using 2 factor matrices L and R. We assume that all users have rates at least once and all items have been rates at least once.
- Parameters:
userIDs – Column vector of user-ids (n x 1)
I – Indicator matrix user-id x user-id to exclude from scoring
L – The factor matrix L: user-id x feature-id
R – The factor matrix R: feature-id x item-id
- Returns:
The output user-id/item-id/score#
- systemds.operator.algorithm.alsTopkPredict(userIDs: Matrix, I: Matrix, L: Matrix, R: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This script computes the top-K rating/scores for a given list of userIDs using 2 factor matrices L and R. We assume that all users have rates at least once and all items have been rates at least once.
- Parameters:
userIDs – Column vector of user-ids (n x 1)
I – Indicator matrix user-id x user-id to exclude from scoring
L – The factor matrix L: user-id x feature-id
R – The factor matrix R: feature-id x item-id
K – The number of top-K items
- Returns:
A matrix containing the top-K item-ids with highest predicted ratings for the specified users (rows)
- Returns:
A matrix containing the top-K predicted ratings for the specified users (rows)
- systemds.operator.algorithm.apply_pipeline(testData: Frame, pip: Frame, applyFunc: Frame, hp: Matrix, exState: List, iState: List, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This script will read the dirty and clean data, then it will apply the best pipeline on dirty data and then will classify both cleaned dataset and check if the cleaned dataset is performing same as original dataset in terms of classification accuracy
- Parameters:
trainData –
—
testData –
—
metaData –
—
lp –
—
pip –
—
hp –
—
evaluationFunc –
—
evalFunHp –
—
isLastLabel –
—
correctTypos –
—
- Returns:
—
- systemds.operator.algorithm.arima(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Builtin function that implements ARIMA
- Parameters:
X – The input Matrix to apply Arima on.
max_func_invoc –
—
p – non-seasonal AR order
d – non-seasonal differencing order
q – non-seasonal MA order
P – seasonal AR order
D – seasonal differencing order
Q – seasonal MA order
s – period in terms of number of time-steps
include_mean – center to mean 0, and include in result
solver – solver, is either “cg” or “jacobi”
- Returns:
The calculated coefficients
- systemds.operator.algorithm.autoencoder_2layer(X: Matrix, num_hidden1: int, num_hidden2: int, max_epochs: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Trains a 2-layer autoencoder with minibatch SGD and step-size decay. If invoked with H1 > H2 then it becomes a ‘bowtie’ structured autoencoder Weights are initialized using Glorot & Bengio (2010) AISTATS initialization. The script standardizes the input before training (can be turned off). Also, it randomly reshuffles rows before training. Currently, tanh is set to be the activation function. By re-implementing ‘func’ DML-bodied function, one can change the activation.
- Parameters:
X – Filename where the input is stored
num_hidden1 – Number of neurons in the 1st hidden layer
num_hidden2 – Number of neurons in the 2nd hidden layer
max_epochs – Number of epochs to train for
full_obj – If TRUE, Computes objective function value (squared-loss) at the end of each epoch. Note that, computing the full objective can take a lot of time.
batch_size – Mini-batch size (training parameter)
step – Initial step size (training parameter)
decay – Decays step size after each epoch (training parameter)
mu – Momentum parameter (training parameter)
W1_rand – Weights might be initialized via input matrices
W2_rand –
—
W3_rand –
—
W4_rand –
—
- Returns:
Matrix storing weights between input layer and 1st hidden layer
- Returns:
Matrix storing bias between input layer and 1st hidden layer
- Returns:
Matrix storing weights between 1st hidden layer and 2nd hidden layer
- Returns:
Matrix storing bias between 1st hidden layer and 2nd hidden layer
- Returns:
Matrix storing weights between 2nd hidden layer and 3rd hidden layer
- Returns:
Matrix storing bias between 2nd hidden layer and 3rd hidden layer
- Returns:
Matrix storing weights between 3rd hidden layer and output layer
- Returns:
Matrix storing bias between 3rd hidden layer and output layer
- Returns:
Matrix storing the hidden (2nd) layer representation if needed
- systemds.operator.algorithm.bandit(X_train: Matrix, Y_train: Matrix, X_test: Matrix, Y_test: Matrix, metaList: List, evaluationFunc: str, evalFunHp: Matrix, lp: Frame, lpHp: Matrix, primitives: Frame, param: Frame, baseLineScore: float, cv: bool, **kwargs: Dict[str, DAGNode | str | int | float | bool])
In The bandit function the objective is to find an arm that optimizes a known functional of the unknown arm-reward distributions.
- Parameters:
X_train –
—
Y_train –
—
X_test –
—
Y_test –
—
metaList –
—
evaluationFunc –
—
evalFunHp –
—
lp –
—
primitives –
—
params –
—
K –
—
R –
—
baseLineScore –
—
cv –
—
cvk –
—
verbose –
—
output –
—
- Returns:
—
- systemds.operator.algorithm.bivar(X: Matrix, S1: Matrix, S2: Matrix, T1: Matrix, T2: Matrix, verbose: bool)
For a given pair of attribute sets, compute bivariate statistics between all attribute pairs. Given, index1 = {A_11, A_12, … A_1m} and index2 = {A_21, A_22, … A_2n} compute bivariate stats for m*n pairs (A_1i, A_2j), (1<= i <=m) and (1<= j <=n).
- Parameters:
X – Input matrix
S1 – First attribute set {A_11, A_12, … A_1m}
S2 – Second attribute set {A_21, A_22, … A_2n}
T1 – Kind for attributes in S1 (kind=1 for scale, kind=2 for nominal, kind=3 for ordinal)
verbose – Print bivar stats
- Returns:
basestats_scale_scale as output with bivar stats
- Returns:
basestats_nominal_scale as output with bivar stats
- Returns:
basestats_nominal_nominal as output with bivar stats
- Returns:
basestats_ordinal_ordinal as output with bivar stats
- systemds.operator.algorithm.components(G: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Computes the connected components of a graph and returns a vector indicating the assignment of vertices to components, where each component is identified by the maximum vertex ID (i.e., row/column position of the input graph)
- Parameters:
X – Location to read the matrix of feature vectors
Y – Location to read the matrix with category labels
icpt – Intercept presence, shifting and rescaling X columns: 0 = no intercept, no shifting, no rescaling; 1 = add intercept, but neither shift nor rescale X; 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
tol – tolerance (“epsilon”)
reg – regularization parameter (lambda = 1/C); intercept is not regularized
maxi – max. number of outer (Newton) iterations
maxii – max. number of inner (conjugate gradient) iterations, 0 = no max
verbose – flag specifying if logging information should be printed
- Returns:
regression betas as output for prediction
- systemds.operator.algorithm.confusionMatrix(P: Matrix, Y: Matrix)
Accepts a vector for prediction and a one-hot-encoded matrix Then it computes the max value of each vector and compare them After which, it calculates and returns the sum of classifications and the average of each true class.
True Labels 1 2 1 TP | FP Predictions ----+---- 2 FN | TN
- Parameters:
P – vector of Predictions
Y – vector of Golden standard One Hot Encoded; the one hot encoded vector of actual labels
- Returns:
The Confusion Matrix Sums of classifications
- Returns:
The Confusion Matrix averages of each true class
- systemds.operator.algorithm.cor(X: Matrix)
This Function compute correlation matrix
- Parameters:
X – A Matrix Input to compute the correlation on
- Returns:
Correlation matrix of the input matrix
- systemds.operator.algorithm.correctTypos(strings: Frame, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Corrects corrupted frames of strings This algorithm operates on the assumption that most strings are correct and simply swaps strings that do not occur often with similar strings that occur more often
References: Fred J. Damerau. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM 7, 3 (March 1964), 171–176. DOI:https://doi.org/10.1145/363958.363994
- Parameters:
strings – The nx1 input frame of corrupted strings
frequency_threshold – Strings that occur above this frequency level will not be corrected
distance_threshold – Max distance at which strings are considered similar
is_verbose – Print debug information
- Returns:
Corrected nx1 output frame
- systemds.operator.algorithm.correctTyposApply(strings: Frame, distance_matrix: Matrix, dict: Frame, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Corrects corrupted frames of strings This algorithm operates on the assumption that most strings are correct and simply swaps strings that do not occur often with similar strings that occur more often
References: Fred J. Damerau. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM 7, 3 (March 1964), 171–176. DOI:https://doi.org/10.1145/363958.363994
TODO: future: add parameter for list of words that are sure to be correct
- Parameters:
strings – The nx1 input frame of corrupted strings
nullMask –
—
frequency_threshold – Strings that occur above this frequency level will not be corrected
distance_threshold – Max distance at which strings are considered similar
matrix (distance) –
—
dict –
—
- Returns:
Corrected nx1 output frame
- systemds.operator.algorithm.cox(X: Matrix, TE: Matrix, F: Matrix, R: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This script fits a cox Proportional hazard regression model. The Breslow method is used for handling ties and the regression parameters are computed using trust region newton method with conjugate gradient
- Parameters:
X – Location to read the input matrix X containing the survival data containing the following information 1: timestamps 2: whether an event occurred (1) or data is censored (0) 3: feature vectors
TE – Column indices of X as a column vector which contain timestamp (first row) and event information (second row)
F – Column indices of X as a column vector which are to be used for fitting the Cox model
R – If factors (categorical variables) are available in the input matrix X, location to read matrix R containing the start and end indices of the factors in X R[,1]: start indices R[,2]: end indices Alternatively, user can specify the indices of the baseline level of each factor which needs to be removed from X; in this case the start and end indices corresponding to the baseline level need to be the same; if R is not provided by default all variables are considered to be continuous
alpha – Parameter to compute a 100*(1-alpha)% confidence interval for the betas
tol – Tolerance (“epsilon”)
moi – Max. number of outer (Newton) iterations
mii – Max. number of inner (conjugate gradient) iterations, 0 = no max
- Returns:
A D x 7 matrix M, where D denotes the number of covariates, with the following schema: M[,1]: betas M[,2]: exp(betas) M[,3]: standard error of betas M[,4]: Z M[,5]: P-value M[,6]: lower 100*(1-alpha)% confidence interval of betas M[,7]: upper 100*(1-alpha)% confidence interval of betas
- Returns:
Two matrices containing a summary of some statistics of the fitted model: 1 - File S with the following format - row 1: no. of observations - row 2: no. of events - row 3: log-likelihood - row 4: AIC - row 5: Rsquare (Cox & Snell) - row 6: max possible Rsquare 2 - File T with the following format - row 1: Likelihood ratio test statistic, degree of freedom, P-value - row 2: Wald test statistic, degree of freedom, P-value - row 3: Score (log-rank) test statistic, degree of freedom, P-value
- Returns:
Additionally, the following matrices are stored (needed for prediction) 1- A column matrix RT that contains the order-preserving recoded timestamps from X 2- Matrix XO which is matrix X with sorted timestamps 3- Variance-covariance matrix of the betas COV 4- A column matrix MF that contains the column indices of X with the baseline factors removed (if available)
- systemds.operator.algorithm.cspline(X: Matrix, Y: Matrix, inp_x: float, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Solves Cubic Spline Interpolation
Algorithms: implement https://en.wikipedia.org/wiki/Spline_interpolation#Algorithm_to_find_the_interpolating_cubic_spline It use natural spline with q1’’(x0) == qn’’(xn) == 0.0
- Parameters:
X – 1-column matrix of x values knots. It is assumed that x values are monotonically increasing and there is no duplicates points in X
Y – 1-column matrix of corresponding y values knots
inp_x – the given input x, for which the cspline will find predicted y
mode – Specifies the method for cspline (DS - Direct Solve, CG - Conjugate Gradient)
tol – Tolerance (epsilon); conjugate graduent procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm
maxi – Maximum number of conjugate gradient iterations, 0 = no maximum
- Returns:
Predicted value
- Returns:
Matrix of k parameters
- systemds.operator.algorithm.csplineCG(X: Matrix, Y: Matrix, inp_x: float, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Builtin that solves cubic spline interpolation using conjugate gradient algorithm
- Parameters:
X – 1-column matrix of x values knots. It is assumed that x values are monotonically increasing and there is no duplicates points in X
Y – 1-column matrix of corresponding y values knots
inp_x – the given input x, for which the cspline will find predicted y.
tol – Tolerance (epsilon); conjugate graduent procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm
maxi – Maximum number of conjugate gradient iterations, 0 = no maximum
- Returns:
Predicted value
- Returns:
Matrix of k parameters
- systemds.operator.algorithm.csplineDS(X: Matrix, Y: Matrix, inp_x: float)
Builtin that solves cubic spline interpolation using a direct solver.
- Parameters:
X – 1-column matrix of x values knots. It is assumed that x values are monotonically increasing and there is no duplicates points in X
Y – 1-column matrix of corresponding y values knots
inp_x – the given input x, for which the cspline will find predicted y.
- Returns:
Predicted value
- Returns:
Matrix of k parameters
- systemds.operator.algorithm.cvlm(X: Matrix, y: Matrix, k: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The cvlm-function is used for cross-validation of the provided data model. This function follows a non-exhaustive cross validation method. It uses lm and lmPredict functions to solve the linear regression and to predict the class of a feature vector with no intercept, shifting, and rescaling.
- Parameters:
X – Recorded Data set into matrix
y – 1-column matrix of response values.
k – Number of subsets needed, It should always be more than 1 and less than nrow(X)
icpt – Intercept presence, shifting and rescaling the columns of X
reg – Regularization constant (lambda) for L2-regularization. set to nonzero for highly dependant/sparse/numerous features
- Returns:
Response values
- Returns:
Validated data set
- systemds.operator.algorithm.dbscan(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Implements the DBSCAN clustering algorithm using Euclidian distance matrix
- Parameters:
X – The input Matrix to do DBSCAN on.
eps – Maximum distance between two points for one to be considered reachable for the other.
minPts – Number of points in a neighborhood for a point to be considered as a core point (includes the point itself).
- Returns:
clustering Matrix
- systemds.operator.algorithm.dbscanApply(X: Matrix, clusterModel: Matrix, eps: float)
Implements the outlier detection/prediction algorithm using a DBScan model
- Parameters:
X – The input Matrix to do outlier detection on.
clusterModel – Model of clusters to predict outliers against.
eps – Maximum distance between two points for one to be considered reachable for the other.
- Returns:
Predicted outliers
- systemds.operator.algorithm.decisionTree(X: Matrix, Y: Matrix, R: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Builtin script implementing classification trees with scale and categorical features
- Parameters:
X – Feature matrix X; note that X needs to be both recoded and dummy coded
Y – Label matrix Y; note that Y needs to be both recoded and dummy coded
R – Matrix R which for each feature in X contains the following information - R[1,]: Row Vector which indicates if feature vector is scalar or categorical. 1 indicates a scalar feature vector, other positive Integers indicate the number of categories If R is not provided by default all variables are assumed to be scale
bins – Number of equiheight bins per scale feature to choose thresholds
depth – Maximum depth of the learned tree
verbose – boolean specifying if the algorithm should print information while executing
- Returns:
Matrix M where each column corresponds to a node in the learned tree and each row contains the following information: M[1,j]: id of node j (in a complete binary tree) M[2,j]: Offset (no. of columns) to left child of j if j is an internal node, otherwise 0 M[3,j]: Feature index of the feature (scale feature id if the feature is scale or categorical feature id if the feature is categorical) that node j looks at if j is an internal node, otherwise 0 M[4,j]: Type of the feature that node j looks at if j is an internal node: holds the same information as R input vector M[5,j]: If j is an internal node: 1 if the feature chosen for j is scale, otherwise the size of the subset of values stored in rows 6,7,… if j is categorical If j is a leaf node: number of misclassified samples reaching at node j M[6:,j]: If j is an internal node: Threshold the example’s feature value is compared to is stored at M[6,j] if the feature chosen for j is scale, otherwise if the feature chosen for j is categorical rows 6,7,… depict the value subset chosen for j If j is a leaf node 1 if j is impure and the number of samples at j > threshold, otherwise 0
- systemds.operator.algorithm.decisionTreePredict(M: Matrix, X: Matrix, strategy: str)
Builtin script implementing prediction based on classification trees with scale features using prediction methods of the Hummingbird paper (https://www.usenix.org/system/files/osdi20-nakandala.pdf).
- Parameters:
M – Decision tree matrix M, as generated by scripts/builtin/decisionTree.dml, where each column corresponds to a node in the learned tree and each row contains the following information: M[1,j]: id of node j (in a complete binary tree) M[2,j]: Offset (no. of columns) to left child of j if j is an internal node, otherwise 0 M[3,j]: Feature index of the feature (scale feature id if the feature is scale or categorical feature id if the feature is categorical) that node j looks at if j is an internal node, otherwise 0 M[4,j]: Type of the feature that node j looks at if j is an internal node: holds the same information as R input vector M[5,j]: If j is an internal node: 1 if the feature chosen for j is scale, otherwise the size of the subset of values stored in rows 6,7,… if j is categorical If j is a leaf node: number of misclassified samples reaching at node j M[6:,j]: If j is an internal node: Threshold the example’s feature value is compared to is stored at M[6,j] if the feature chosen for j is scale, otherwise if the feature chosen for j is categorical rows 6,7,… depict the value subset chosen for j If j is a leaf node 1 if j is impure and the number of samples at j > threshold, otherwise 0
X – Feature matrix X
strategy – Prediction strategy, can be one of [“GEMM”, “TT”, “PTT”], referring to “Generic matrix multiplication”, “Tree traversal”, and “Perfect tree traversal”, respectively
- Returns:
Matrix containing the predicted labels for X
- systemds.operator.algorithm.deepWalk(Graph: Matrix, w: int, d: int, gamma: int, t: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This script performs DeepWalk on a given graph (https://arxiv.org/pdf/1403.6652.pdf)
- Parameters:
Graph – adjacency matrix of a graph (n x n)
w – window size
d – embedding size
gamma – walks per vertex
t – walk length
alpha – learning rate
beta – factor for decreasing learning rate
- Returns:
matrix of vertex/word representation (n x d)
- systemds.operator.algorithm.denialConstraints(dataFrame: Frame, constraintsFrame: Frame)
This function considers some constraints indicating statements that can NOT happen in the data (denial constraints).
EXAMPLE: dataFrame: rank discipline yrs.since.phd yrs.service sex salary 1 Prof B 19 18 Male 139750 2 Prof B 20 16 Male 173200 3 AsstProf B 3 3 Male 79750.56 4 Prof B 45 39 Male 115000 5 Prof B 40 40 Male 141500 6 AssocProf B 6 6 Male 97000 7 Prof B 30 23 Male 175000 8 Prof B 45 45 Male 147765 9 Prof B 21 20 Male 119250 10 Prof B 18 18 Female 129000 11 AssocProf B 12 8 Male 119800 12 AsstProf B 7 2 Male 79800 13 AsstProf B 1 1 Male 77700 constraintsFrame: idx constraint.type group.by group.variable group.option variable1 relation variable2 1 variableCompare FALSE yrs.since.phd < yrs.service 2 instanceCompare TRUE rank Prof yrs.service >< salary 3 valueCompare FALSE salary = 78182 4 variableCompare TRUE discipline B yrs.service > yrs.since.phd
Example: explanation of constraint 2 –> it can’t happen that one professor of rank Prof has more years of service than other, but lower salary.
- Parameters:
dataFrame – frame which columns represent the variables of the data and the rows correspond to different tuples or instances. Recommended to have a column indexing the instances from 1 to N (N=number of instances).
constraintsFrame – frame with fixed columns and each row representing one constraint. 1. idx: (double) index of the constraint, from 1 to M (number of constraints) 2. constraint.type: (string) The constraints can be of 3 different kinds: - variableCompare: for each instance, it will compare the values of two variables (with a relation <, > or =). - valueCompare: for each instance, it will compare a fixed value and a variable value (with a relation <, > or =). - instanceCompare: for every couple of instances, it will compare the relation between two variables, ie if the value of the variable 1 in instance 1 is lower/higher than the value of variable 1 in instance 2, then the value of of variable 2 in instance 2 can’t be lower/higher than the value of variable 2 in instance 2. 3. group.by: (boolean) if TRUE only one group of data (defined by a variable option) will be considered for the constraint. 4. group.variable: (string, only if group.by TRUE) name of the variable (column in dataFrame) that will divide our data in groups. 5. group.option: (only if group.by TRUE) option of the group.variable that defines the group to consider. 6. variable1: (string) first variable to compare (name of column in dataFrame). 7. relation: (string) can be < , > or = in the case of variableCompare and valueCompare, and < >, < < , > < or > > in the case of instanceCompare 8. variable2: (string) second variable to compare (name of column in dataFrame) or fixed value for the case of valueCompare.
- Returns:
Matrix of 2 columns. - First column shows the indexes of dataFrame that are wrong. - Second column shows the index of the denial constraint that is fulfilled If there are no wrong instances to show (0 constrains fulfilled) –> WrongInstances=matrix(0,1,2)
- systemds.operator.algorithm.discoverFD(X: Matrix, Mask: Matrix, threshold: float)
Implements builtin for finding functional dependencies
- Parameters:
X – Input Matrix X, encoded Matrix if data is categorical
Mask – A row vector for interested features i.e. Mask =[1, 0, 1] will exclude the second column from processing
threshold – threshold value in interval [0, 1] for robust FDs
- Returns:
matrix of functional dependencies
- systemds.operator.algorithm.dist(X: Matrix)
Returns Euclidean distance matrix (distances between N n-dimensional points)
- Parameters:
X – Matrix to calculate the distance inside
- Returns:
Euclidean distance matrix
- systemds.operator.algorithm.dmv(X: Frame, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The dmv-function is used to find disguised missing values utilising syntactical pattern recognition.
- Parameters:
X – Input Frame
threshold – Threshold value in interval [0, 1] for dominant pattern per column (e.g., 0.8 means that 80% of the entries per column must adhere this pattern to be dominant)
replace – The string disguised missing values are replaced with
- Returns:
Frame X including detected disguised missing values
- systemds.operator.algorithm.ema(X: Frame, search_iterations: int, mode: str, freq: int, alpha: float, beta: float, gamma: float)
This function imputes values with exponential moving average (single, double or triple).
- Parameters:
X – Frame that contains time series data that needs to be imputed search_iterations Integer – Budget iterations for parameter optimization, used if parameters weren’t set
mode – Type of EMA method. Either “single”, “double” or “triple”
freq – Seasonality when using triple EMA.
alpha – alpha- value for EMA
beta – beta- value for EMA
gamma – gamma- value for EMA
- Returns:
Frame with EMA results
- systemds.operator.algorithm.executePipeline(pipeline: Frame, Xtrain: Matrix, Ytrain: Matrix, Xtest: Matrix, Ytest: Matrix, metaList: List, hyperParameters: Matrix, flagsCount: int, verbose: bool, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This function execute pipeline.
- Parameters:
logical –
—
pipeline –
—
X –
—
Y –
—
Xtest –
—
Ytest –
—
metaList –
—
hyperParameters –
—
hpForPruning –
—
changesByOp –
—
flagsCount –
—
test –
—
verbose –
—
- Returns:
—
- Returns:
—
- Returns:
—
- Returns:
—
- Returns:
—
- Returns:
—
- Returns:
—
- systemds.operator.algorithm.ffPredict(model: List, X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This builtin function makes prediction given data and trained feedforward neural network model
- Parameters:
Model – Trained ff neural network model
X – Data used for making predictions
batch_size – Batch size
- Returns:
Predicted value
- systemds.operator.algorithm.ffTrain(X: Matrix, Y: Matrix, out_activation: str, loss_fcn: str, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This builtin function trains simple feed-forward neural network. The architecture of the networks is: affine1 -> relu -> dropout -> affine2 -> configurable output activation function. Hidden layer has 128 neurons. Dropout rate is 0.35. Input and output sizes are inferred from X and Y.
- Parameters:
X – Training data
Y – Labels/Target values
batch_size – Batch size
epochs – Number of epochs
learning_rate – Learning rate
out_activation – User specified output activation function. Possible values: “sigmoid”, “relu”, “lrelu”, “tanh”, “softmax”, “logits” (no activation).
loss_fcn – User specified loss function. Possible values: “l1”, “l2”, “log_loss”, “logcosh_loss”, “cel” (cross-entropy loss).
shuffle – Flag which indicates if dataset should be shuffled or not
validation_split – Fraction of training set used as validation set
seed – Seed for model initialization
verbose – Flag which indicates if function should print to stdout
- Returns:
Trained model which can be used in ffPredict
- systemds.operator.algorithm.fit_pipeline(trainData: Frame, testData: Frame, pip: Frame, applyFunc: Frame, hp: Matrix, evaluationFunc: str, evalFunHp: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This script will read the dirty and clean data, then it will apply the best pipeline on dirty data and then will classify both cleaned dataset and check if the cleaned dataset is performing same as original dataset in terms of classification accuracy
- Parameters:
trainData –
—
testData –
—
metaData –
—
lp –
—
pip –
—
hp –
—
evaluationFunc –
—
evalFunHp –
—
isLastLabel –
—
correctTypos –
—
- Returns:
—
- systemds.operator.algorithm.fixInvalidLengths(F1: Frame, mask: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Fix invalid lengths
- Parameters:
F1 –
—
mask –
—
ql –
—
qu –
—
- Returns:
—
- Returns:
—
- systemds.operator.algorithm.fixInvalidLengthsApply(X: Frame, mask: Matrix, qLow: Matrix, qUp: Matrix)
Fix invalid lengths
- Parameters:
X –
—
mask –
—
ql –
—
qu –
—
- Returns:
—
- Returns:
—
- systemds.operator.algorithm.frameSort(F: Frame, mask: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Related to [SYSTEMDS-2662] dependency function for cleaning pipelines Built-in for sorting frames
- Parameters:
F – Data frame of string values
mask – matrix for identifying string columns
- Returns:
sorted dataset by column 1 in decreasing order
- systemds.operator.algorithm.frequencyEncode(X: Matrix, mask: Matrix)
function frequency conversion
- Parameters:
X – dataset x
mask – mask of the columns for frequency conversion
- Returns:
categorical columns are replaced with their frequencies
- Returns:
the frequency counts for the different categoricals
- systemds.operator.algorithm.frequencyEncodeApply(X: Matrix, freqCount: Matrix)
frequency code apply
- Parameters:
X – dataset x
freqCount – the frequency counts for the different categoricals
- Returns:
categorical columns are replaced with their frequencies given
- systemds.operator.algorithm.garch(X: Matrix, kmax: int, momentum: float, start_stepsize: float, end_stepsize: float, start_vicinity: float, end_vicinity: float, sim_seed: int, verbose: bool)
This is a builtin function that implements GARCH(1,1), a statistical model used in analyzing time-series data where the variance error is believed to be serially autocorrelated
COMMENTS This has some drawbacks: slow convergence of optimization (sort of simulated annealing/gradient descent) TODO: use BFGS or BHHH if it is available (this are go to methods) TODO: (only then) extend to garch(p,q); otherwise the search space is way too big for the current method
- Parameters:
X – The input Matrix to apply Arima on.
kmax – Number of iterations
momentum – Momentum for momentum-gradient descent (set to 0 to deactivate)
start_stepsize – Initial gradient-descent stepsize
end_stepsize – gradient-descent stepsize at end (linear descent)
start_vicinity – proportion of randomness of restart-location for gradient descent at beginning
end_vicinity – same at end (linear decay)
sim_seed – seed for simulation of process on fitted coefficients
verbose – verbosity, comments during fitting
- Returns:
simulated garch(1,1) process on fitted coefficients
- Returns:
variances of simulated fitted process
- Returns:
onstant term of fitted process
- Returns:
1-st arch-coefficient of fitted process
- Returns:
1-st garch-coefficient of fitted process
- systemds.operator.algorithm.gaussianClassifier(D: Matrix, C: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Computes the parameters needed for Gaussian Classification. Thus it computes the following per class: the prior probability, the inverse covariance matrix, the mean per feature and the determinant of the covariance matrix. Furthermore (if not explicitly defined), it adds some small smoothing value along the variances, to prevent numerical errors / instabilities.
- Parameters:
D – Input matrix (training set)
C – Target vector
varSmoothing – Smoothing factor for variances
verbose – Print accuracy of the training set
- Returns:
Vector storing the class prior probabilities
- Returns:
Matrix storing the means of the classes
- Returns:
List of inverse covariance matrices
- Returns:
Vector storing the determinants of the classes
- systemds.operator.algorithm.getAccuracy(y: Matrix, yhat: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This builtin function compute the weighted and simple accuracy for given predictions
- Parameters:
y – Ground truth (Actual Labels)
yhat – Predictions (Predicted labels)
isWeighted – Flag for weighted or non-weighted accuracy calculation
- Returns:
accuracy of the predicted labels
- systemds.operator.algorithm.glm(X: Matrix, Y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This script solves GLM regression using NEWTON/FISHER scoring with trust regions. The glm-function is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models.
In addition, some GLM statistics are provided as console output by setting verbose=TRUE, one comma-separated name-value pair per each line, as follows:
-------------------------------------------------------------------------------------------- TERMINATION_CODE A positive integer indicating success/failure as follows: 1 = Converged successfully; 2 = Maximum number of iterations reached; 3 = Input (X, Y) out of range; 4 = Distribution/link is not supported BETA_MIN Smallest beta value (regression coefficient), excluding the intercept BETA_MIN_INDEX Column index for the smallest beta value BETA_MAX Largest beta value (regression coefficient), excluding the intercept BETA_MAX_INDEX Column index for the largest beta value INTERCEPT Intercept value, or NaN if there is no intercept (if icpt=0) DISPERSION Dispersion used to scale deviance, provided as "disp" input parameter or estimated (same as DISPERSION_EST) if the "disp" parameter is <= 0 DISPERSION_EST Dispersion estimated from the dataset DEVIANCE_UNSCALED Deviance from the saturated model, assuming dispersion == 1.0 DEVIANCE_SCALED Deviance from the saturated model, scaled by the DISPERSION value -------------------------------------------------------------------------------------------- The Log file, when requested, contains the following per-iteration variables in CSV format, each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for initial values: -------------------------------------------------------------------------------------------- NUM_CG_ITERS Number of inner (Conj.Gradient) iterations in this outer iteration IS_TRUST_REACHED 1 = trust region boundary was reached, 0 = otherwise POINT_STEP_NORM L2-norm of iteration step from old point (i.e. "beta") to new point OBJECTIVE The loss function we minimize (i.e. negative partial log-likelihood) OBJ_DROP_REAL Reduction in the objective during this iteration, actual value OBJ_DROP_PRED Reduction in the objective predicted by a quadratic approximation OBJ_DROP_RATIO Actual-to-predicted reduction ratio, used to update the trust region GRADIENT_NORM L2-norm of the loss function gradient (NOTE: sometimes omitted) LINEAR_TERM_MIN The minimum value of X %*% beta, used to check for overflows LINEAR_TERM_MAX The maximum value of X %*% beta, used to check for overflows IS_POINT_UPDATED 1 = new point accepted; 0 = new point rejected, old point restored TRUST_DELTA Updated trust region size, the "delta" --------------------------------------------------------------------------------------------
SOME OF THE SUPPORTED GLM DISTRIBUTION FAMILIES AND LINK FUNCTIONS:
dfam vpow link lpow Distribution.link nical? --------------------------------------------------- 1 0.0 1 -1.0 Gaussian.inverse 1 0.0 1 0.0 Gaussian.log 1 0.0 1 1.0 Gaussian.id Yes 1 1.0 1 0.0 Poisson.log Yes 1 1.0 1 0.5 Poisson.sqrt 1 1.0 1 1.0 Poisson.id 1 2.0 1 -1.0 Gamma.inverse Yes 1 2.0 1 0.0 Gamma.log 1 2.0 1 1.0 Gamma.id 1 3.0 1 -2.0 InvGaussian.1/mu^2 Yes 1 3.0 1 -1.0 InvGaussian.inverse 1 3.0 1 0.0 InvGaussian.log 1 3.0 1 1.0 InvGaussian.id 1 * 1 * AnyVariance.AnyLink --------------------------------------------------- 2 * 1 0.0 Binomial.log 2 * 1 0.5 Binomial.sqrt 2 * 2 * Binomial.logit Yes 2 * 3 * Binomial.probit 2 * 4 * Binomial.cloglog 2 * 5 * Binomial.cauchit ---------------------------------------------------
- Parameters:
X – matrix X of feature vectors
Y – matrix Y with either 1 or 2 columns: if dfam = 2, Y is 1-column Bernoulli or 2-column Binomial (#pos, #neg)
dfam – Distribution family code: 1 = Power, 2 = Binomial
vpow – Power for Variance defined as (mean)^power (ignored if dfam != 1): 0.0 = Gaussian, 1.0 = Poisson, 2.0 = Gamma, 3.0 = Inverse Gaussian
link – Link function code: 0 = canonical (depends on distribution), 1 = Power, 2 = Logit, 3 = Probit, 4 = Cloglog, 5 = Cauchit
lpow – Power for Link function defined as (mean)^power (ignored if link != 1): -2.0 = 1/mu^2, -1.0 = reciprocal, 0.0 = log, 0.5 = sqrt, 1.0 = identity
yneg – Response value for Bernoulli “No” label, usually 0.0 or -1.0
icpt – Intercept presence, X columns shifting and rescaling: 0 = no intercept, no shifting, no rescaling; 1 = add intercept, but neither shift nor rescale X; 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
reg – Regularization parameter (lambda) for L2 regularization
tol – Tolerance (epsilon)
disp – (Over-)dispersion value, or 0.0 to estimate it from data
moi – Maximum number of outer (Newton / Fisher Scoring) iterations
mii – Maximum number of inner (Conjugate Gradient) iterations, 0 = no maximum
verbose – if the Algorithm should be verbose
- Returns:
Matrix beta, whose size depends on icpt: icpt=0: ncol(X) x 1; icpt=1: (ncol(X) + 1) x 1; icpt=2: (ncol(X) + 1) x 2
- systemds.operator.algorithm.glmPredict(X: Matrix, B: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Applies the estimated parameters of a GLM type regression to a new dataset
Additional statistics are printed one per each line, in the following
CSV format: NAME,[COLUMN],[SCALED],VALUE --- NAME is the string identifier for the statistic, see the table below. COLUMN is an optional integer value that specifies the Y-column for per-column statistics; note that a Binomial/Multinomial one-column Y input is converted into multi-column. SCALED is an optional Boolean value (TRUE or FALSE) that tells us whether or not the input dispersion parameter (disp) scaling has been applied to this statistic. VALUE is the value of the statistic. ---
NAME COLUMN SCALED MEANING --------------------------------------------------------------------------------------------- LOGLHOOD_Z + Log-Likelihood Z-score (in st.dev's from mean) LOGLHOOD_Z_PVAL + Log-Likelihood Z-score p-value PEARSON_X2 + Pearson residual X^2 statistic PEARSON_X2_BY_DF + Pearson X^2 divided by degrees of freedom PEARSON_X2_PVAL + Pearson X^2 p-value DEVIANCE_G2 + Deviance from saturated model G^2 statistic DEVIANCE_G2_BY_DF + Deviance G^2 divided by degrees of freedom DEVIANCE_G2_PVAL + Deviance G^2 p-value AVG_TOT_Y + Average of Y column for a single response value STDEV_TOT_Y + St.Dev. of Y column for a single response value AVG_RES_Y + Average of column residual, i.e. of Y - mean(Y|X) STDEV_RES_Y + St.Dev. of column residual, i.e. of Y - mean(Y|X) PRED_STDEV_RES + + Model-predicted St.Dev. of column residual R2 + R^2 of Y column residual with bias included ADJUSTED_R2 + Adjusted R^2 of Y column residual with bias included R2_NOBIAS + R^2 of Y column residual with bias subtracted ADJUSTED_R2_NOBIAS + Adjusted R^2 of Y column residual with bias subtracted ---------------------------------------------------------------------------------------------
- Parameters:
X – Matrix X of records (feature vectors)
B – GLM regression parameters (the betas), with dimensions ncol(X) x k: do not add intercept ncol(X)+1 x k: add intercept as given by the last B-row if k > 1, use only B[, 1] unless it is Multinomial Logit (dfam=3)
ytest – Response matrix Y, with the following dimensions: nrow(X) x 1 : for all distributions (dfam=1 or 2 or 3) nrow(X) x 2 : for Binomial (dfam=2) given by (#pos, #neg) counts nrow(X) x k+1: for Multinomial (dfam=3) given by category counts
dfam – GLM distribution family: 1 = Power, 2 = Binomial, 3 = Multinomial Logit
vpow – Power for Variance defined as (mean)^power (ignored if dfam != 1): 0.0 = Gaussian, 1.0 = Poisson, 2.0 = Gamma, 3.0 = Inverse Gaussian
link – Link function code: 0 = canonical (depends on distribution), 1 = Power, 2 = Logit, 3 = Probit, 4 = Cloglog, 5 = Cauchit; ignored if Multinomial
lpow – Power for Link function defined as (mean)^power (ignored if link != 1): -2.0 = 1/mu^2, -1.0 = reciprocal, 0.0 = log, 0.5 = sqrt, 1.0 = identity
disp – Dispersion value, when available
verbose – Print statistics to stdout
- Returns:
Matrix M of predicted means/probabilities: nrow(X) x 1 : for Power-type distributions (dfam=1) nrow(X) x 2 : for Binomial distribution (dfam=2), column 2 is “No” nrow(X) x k+1: for Multinomial Logit (dfam=3), col# k+1 is baseline
- systemds.operator.algorithm.gmm(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Gaussian Mixture Model training algorithm. There are four different types of covariance matrices i.e., VVV, EEE, VVI, VII and two initialization methods namely “kmeans” and “random”.
- Parameters:
X – Dataset input to fit the GMM model
n_components – Number of components to use in the Gaussian mixture model
model – “VVV”: unequal variance (full),each component has its own general covariance matrix “EEE”: equal variance (tied), all components share the same general covariance matrix “VVI”: spherical, unequal volume (diag), each component has its own diagonal covariance matrix “VII”: spherical, equal volume (spherical), each component has its own single variance
init_param – Initialization algorithm to use to initialize the gaussian weights, valid inputs are: “kmeans” or “random”
iterations – Number of iterations
reg_covar – Regularization parameter for covariance matrix
tol – Tolerance value for convergence
seed – The seed value to initialize the values for fitting the GMM.
- Returns:
The predictions made by the gaussian model on the X input dataset
- Returns:
Probability of the predictions given the X input dataset
- Returns:
Number of estimated parameters
- Returns:
Bayesian information criterion for best iteration
- Returns:
Fitted clusters mean
- Returns:
Fitted precision matrix for each mixture
- Returns:
The weight matrix: A matrix whose [i,k]th entry is the probability that observation i in the test data belongs to the kth class
- systemds.operator.algorithm.gmmPredict(X: Matrix, weight: Matrix, mu: Matrix, precisions_cholesky: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Prediction function for a Gaussian Mixture Model (gmm). Compute posterior probabilities for new instances given the variance and mean of fitted dat.
- Parameters:
X – Dataset input to predict the labels from
weight – Weight of learned model: A matrix whose [i,k]th entry is the probability that observation i in the test data belongs to the kth class
mu – Fitted clusters mean
precisions_cholesky – Fitted precision matrix for each mixture
model – “VVV”: unequal variance (full),each component has its own general covariance matrix “EEE”: equal variance (tied), all components share the same general covariance matrix “VVI”: spherical, unequal volume (diag), each component has its own diagonal covariance matrix “VII”: spherical, equal volume (spherical), each component has its own single variance
- Returns:
The predictions made by the gaussian model on the X input dataset
- Returns:
Probability of the predictions given the X input dataset
- systemds.operator.algorithm.gnmf(X: Matrix, rnk: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The gnmf-function does Gaussian Non-Negative Matrix Factorization. In this, a matrix X is factorized into two matrices W and H, such that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect.
References: [Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang: Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. WWW 2010: 681-690]
- Parameters:
X – Matrix of feature vectors.
rnk – Number of components into which matrix X is to be factored
eps – Tolerance
maxi – Maximum number of conjugate gradient iterations
- Returns:
List of pattern matrices, one for each repetition
- Returns:
List of amplitude matrices, one for each repetition
- systemds.operator.algorithm.gridSearch(X: Matrix, y: Matrix, train: str, predict: str, params: List, paramValues: List, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The gridSearch-function is used to find the optimal hyper-parameters of a model which results in the most accurate predictions. This function takes train and eval functions by name.
- Parameters:
X – Input feature matrix
y – Input Matrix of vectors.
train – Name ft of the train function to call via ft(trainArgs)
predict – Name fp of the loss function to call via fp((predictArgs,B))
numB – Maximum number of parameters in model B (pass the max because the size may vary with parameters like icpt or multi-class classification)
params – List of varied hyper-parameter names
dataArgs – List of data parameters (to identify data parameters by name i.e. list(“X”, “Y”))
paramValues – List of matrices providing the parameter values as columnvectors for position-aligned hyper-parameters in ‘params’
trainArgs – named List of arguments to pass to the ‘train’ function, where gridSearch replaces enumerated hyper-parameter by name, if not provided or an empty list, the lm parameters are used
predictArgs – List of arguments to pass to the ‘predict’ function, where gridSearch appends the trained models at the end, if not provided or an empty list, list(X, y) is used instead
cv – flag enabling k-fold cross validation, otherwise training loss
cvk – if cv=TRUE, specifies the the number of folds, otherwise ignored
verbose – flag for verbose debug output
- Returns:
Matrix[Double]the trained model with minimal loss (by the ‘predict’ function) Multi-column models are returned as a column-major linearized column vector
- Returns:
one-row frame w/ optimal hyper-parameters (by ‘params’ position)
- systemds.operator.algorithm.hospitalResidencyMatch(R: Matrix, H: Matrix, capacity: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This script computes a solution for the hospital residency match problem.
Residents.mtx: 2.0,1.0,3.0 1.0,2.0,3.0 1.0,2.0,0.0
Since it is an ORDERED matrix, this means that Resident 1 (row 1) likes hospital 2 the most, followed by hospital 1 and hospital 3. If it was UNORDERED, this would mean that resident 1 (row 1) likes hospital 3 the most (since the value at [1,3] is the row max), followed by hospital 1 (2.0 preference value) and hospital 2 (1.0 preference value).
Hospitals.mtx: 2.0,1.0,0.0 0.0,1.0,2.0 1.0,2.0,0.0
Since it is an UNORDERED matrix this means that Hospital 1 (row 1) likes Resident 1 the most (since the value at [1,1] is the row max).
capacity.mtx 1.0 1.0 1.0
residencyMatch.mtx 2.0,0.0,0.0 1.0,0.0,0.0 0.0,2.0,0.0
hospitalMatch.mtx 0.0,1.0,0.0 0.0,0.0,2.0 1.0,0.0,0.0
Resident 1 has matched with Hospital 3 (since [1,3] is non-zero) at a preference level of 2.0. Resident 2 has matched with Hospital 1 (since [2,1] is non-zero) at a preference level of 1.0. Resident 3 has matched with Hospital 2 (since [3,2] is non-zero) at a preference level of 2.0.
- Parameters:
R – Residents matrix R. It must be an ORDERED matrix.
H – Hospitals matrix H. It must be an UNORDRED matrix.
capacity – capacity of Hospitals matrix C. It must be a [n*1] matrix with non zero values. i.e. the leftmost value in a row is the most preferred partner’s index. i.e. the leftmost value in a row in P is the preference value for the acceptor with index 1 and vice-versa (higher is better).
verbose – If the operation is verbose
- Returns:
Result Matrix If cell [i,j] is non-zero, it means that Resident i has matched with Hospital j. Further, if cell [i,j] is non-zero, it holds the preference value that led to the match.
- Returns:
Result Matrix If cell [i,j] is non-zero, it means that Resident i has matched with Hospital j. Further, if cell [i,j] is non-zero, it holds the preference value that led to the match.
- systemds.operator.algorithm.hyperband(X_train: Matrix, y_train: Matrix, X_val: Matrix, y_val: Matrix, params: List, paramRanges: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The hyperband-function is used for hyper parameter optimization and is based on multi-armed bandits and early elimination. Through multiple parallel brackets and consecutive trials it will return the hyper parameter combination which performed best on a validation dataset. A set of hyper parameter combinations is drawn from uniform distributions with given ranges; Those make up the candidates for hyperband. Notes: hyperband is hard-coded for lmCG, and uses lmPredict for validation hyperband is hard-coded to use the number of iterations as a resource hyperband can only optimize continuous hyperparameters
- Parameters:
X_train – Input Matrix of training vectors
y_train – Labels for training vectors
X_val – Input Matrix of validation vectors
y_val – Labels for validation vectors
params – List of parameters to optimize
paramRanges – The min and max values for the uniform distributions to draw from. One row per hyper parameter, first column specifies min, second column max value.
R – Controls number of candidates evaluated
eta – Determines fraction of candidates to keep after each trial
verbose – If TRUE print messages are activated
- Returns:
1-column matrix of weights of best performing candidate
- Returns:
hyper parameters of best performing candidate
- systemds.operator.algorithm.img_brightness(img_in: Matrix, value: float, channel_max: int)
The img_brightness-function is an image data augmentation function. It changes the brightness of the image.
- Parameters:
img_in – Input matrix/image
value – The amount of brightness to be changed for the image
channel_max – Maximum value of the brightness of the image
- Returns:
Output matrix/image
- systemds.operator.algorithm.img_crop(img_in: Matrix, w: int, h: int, x_offset: int, y_offset: int)
The img_crop-function is an image data augmentation function. It cuts out a subregion of an image.
- Parameters:
img_in – Input matrix/image
w – The width of the subregion required
h – The height of the subregion required
x_offset – The horizontal coordinate in the image to begin the crop operation
y_offset – The vertical coordinate in the image to begin the crop operation
- Returns:
Cropped matrix/image
- systemds.operator.algorithm.img_cutout(img_in: Matrix, x: int, y: int, width: int, height: int, fill_value: float)
Image Cutout function replaces a rectangular section of an image with a constant value.
- Parameters:
img_in – Input image as 2D matrix with top left corner at [1, 1]
x – Column index of the top left corner of the rectangle (starting at 1)
y – Row index of the top left corner of the rectangle (starting at 1)
width – Width of the rectangle (must be positive)
height – Height of the rectangle (must be positive)
fill_value – The value to set for the rectangle
- Returns:
Output image as 2D matrix with top left corner at [1, 1]
- systemds.operator.algorithm.img_invert(img_in: Matrix, max_value: float)
This is an image data augmentation function. It inverts an image.
- Parameters:
img_in – Input image
max_value – The maximum value pixels can have
- Returns:
Output image
- systemds.operator.algorithm.img_mirror(img_in: Matrix, horizontal_axis: bool)
This function is an image data augmentation function. It flips an image on the X (horizontal) or Y (vertical) axis.
- Parameters:
img_in – Input matrix/image
max_value – The maximum value pixels can have
- Returns:
Flipped matrix/image
- systemds.operator.algorithm.img_posterize(img_in: Matrix, bits: int)
The Image Posterize function limits pixel values to 2^bits different values in the range [0, 255]. Assumes the input image can attain values in the range [0, 255].
- Parameters:
img_in – Input image
bits – The number of bits keep for the values. 1 means black and white, 8 means every integer between 0 and 255.
- Returns:
Output image
- systemds.operator.algorithm.img_rotate(img_in: Matrix, radians: float, fill_value: float)
The Image Rotate function rotates the input image counter-clockwise around the center. Uses nearest neighbor sampling.
- Parameters:
img_in – Input image as 2D matrix with top left corner at [1, 1]
radians – The value by which to rotate in radian.
fill_value – The background color revealed by the rotation
- Returns:
Output image as 2D matrix with top left corner at [1, 1]
- systemds.operator.algorithm.img_sample_pairing(img_in1: Matrix, img_in2: Matrix, weight: float)
The image sample pairing function blends two images together.
- Parameters:
img_in1 – First input image
img_in2 – Second input image
weight – The weight given to the second image. 0 means only img_in1, 1 means only img_in2 will be visible
- Returns:
Output image
- systemds.operator.algorithm.img_shear(img_in: Matrix, shear_x: float, shear_y: float, fill_value: float)
This function applies a shearing transformation to an image. Uses nearest neighbor sampling.
- Parameters:
img_in – Input image as 2D matrix with top left corner at [1, 1]
shear_x – Shearing factor for horizontal shearing
shear_y – Shearing factor for vertical shearing
fill_value – The background color revealed by the shearing
- Returns:
Output image as 2D matrix with top left corner at [1, 1]
- systemds.operator.algorithm.img_transform(img_in: Matrix, out_w: int, out_h: int, a: float, b: float, c: float, d: float, e: float, f: float, fill_value: float)
The Image Transform function applies an affine transformation to an image. Optionally resizes the image (without scaling). Uses nearest neighbor sampling.
- Parameters:
img_in – Input image as 2D matrix with top left corner at [1, 1]
out_w – Width of the output image
out_h – Height of the output image
a,b,c,d,e,f – The first two rows of the affine matrix in row-major order
fill_value – The background of the image
- Returns:
Output image as 2D matrix with top left corner at [1, 1]
- systemds.operator.algorithm.img_translate(img_in: Matrix, offset_x: float, offset_y: float, out_w: int, out_h: int, fill_value: float)
The Image Translate function translates the image. Optionally resizes the image (without scaling). Uses nearest neighbor sampling.
- Parameters:
img_in – Input image as 2D matrix with top left corner at [1, 1]
offset_x – The distance to move the image in x direction
offset_y – The distance to move the image in y direction
out_w – Width of the output image
out_h – Height of the output image
fill_value – The background of the image
- Returns:
Output image as 2D matrix with top left corner at [1, 1]
- systemds.operator.algorithm.impurityMeasures(X: Matrix, Y: Matrix, R: Matrix, method: str, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This function computes the measure of impurity for the given dataset based on the passed method (gini or entropy). The current version expects the target vector to contain only 0 or 1 values.
- Parameters:
X – Feature matrix.
Y – Target vector containing 0 and 1 values.
R – Vector indicating whether a feature is categorical or continuous. 1 denotes a continuous feature, 2 denotes a categorical feature.
n_bins – Number of bins for binning in case of scale features.
method – String indicating the method to use; either “entropy” or “gini”.
- Returns:
(1 x ncol(X)) row vector containing information/gini gain for each feature of the dataset. In case of gini, the values denote the gini gains, i.e. how much impurity was removed with the respective split. The higher the value, the better the split. In case of entropy, the values denote the information gain, i.e. how much entropy was removed. The higher the information gain, the better the split.
- systemds.operator.algorithm.imputeByFD(X: Matrix, Y: Matrix, threshold: float, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Implements builtin for imputing missing values from observed values (if exist) using robust functional dependencies
- Parameters:
X – Vector X, source attribute of functional dependency
Y – Vector Y, target attribute of functional dependency and imputation
threshold – threshold value in interval [0, 1] for robust FDs
verbose – flag for printing verbose debug output
- Returns:
Vector Y, with missing values mapped to a new max value
- Returns:
Vector Y, with imputed missing values
- systemds.operator.algorithm.imputeByFDApply(X: Matrix, Y_imp: Matrix)
Implements builtin for imputing missing values from observed values (if exist) using robust functional dependencies
- Parameters:
X – Matrix X
source – source attribute to use for imputation and error correction
target – attribute to be fixed
threshold – threshold value in interval [0, 1] for robust FDs
- Returns:
Matrix with possible imputations
- systemds.operator.algorithm.imputeByMean(X: Matrix, mask: Matrix)
impute the data by mean value and if the feature is categorical then by mode value Related to [SYSTEMDS-2662] dependency function for cleaning pipelines
- Parameters:
X – Data Matrix (Recoded Matrix for categorical features)
mask – A 0/1 row vector for identifying numeric (0) and categorical features (1)
- Returns:
imputed dataset
- systemds.operator.algorithm.imputeByMeanApply(X: Matrix, imputedVec: Matrix)
impute the data by mean value and if the feature is categorical then by mode value Related to [SYSTEMDS-2662] dependency function for cleaning pipelines
- Parameters:
X – Data Matrix (Recoded Matrix for categorical features)
imputationVector – column mean vector
- Returns:
imputed dataset
- systemds.operator.algorithm.imputeByMedian(X: Matrix, mask: Matrix)
Related to [SYSTEMDS-2662] dependency function for cleaning pipelines
impute the data by median value and if the feature is categorical then by mode value
- Parameters:
X – Data Matrix (Recoded Matrix for categorical features)
mask – A 0/1 row vector for identifying numeric (0) and categorical features (1)
- Returns:
imputed dataset
- systemds.operator.algorithm.imputeByMedianApply(X: Matrix, imputedVec: Matrix)
impute the data by median value and if the feature is categorical then by mode value Related to [SYSTEMDS-2662] dependency function for cleaning pipelines
- Parameters:
X – Data Matrix (Recoded Matrix for categorical features)
imputationVector – column median vector
- Returns:
imputed dataset
- systemds.operator.algorithm.imputeByMode(X: Matrix)
This function impute the data by mode value Related to [SYSTEMDS-2902] dependency function for cleaning pipelines
- Parameters:
X – Data Matrix (Recoded Matrix for categorical features)
- Returns:
imputed dataset
- systemds.operator.algorithm.imputeByModeApply(X: Matrix, imputedVec: Matrix)
impute the data by most frequent value (recoded data only) Related to [SYSTEMDS-2662] dependency function for cleaning pipelines
- Parameters:
X – Data Matrix (Recoded Matrix for categorical features)
imputationVector – column mean vector
- Returns:
imputed dataset
- systemds.operator.algorithm.intersect(X: Matrix, Y: Matrix)
Implements set intersection for numeric data
- Parameters:
X – matrix X, set A
Y – matrix Y, set B
- Returns:
intersection matrix, set of intersecting items
- systemds.operator.algorithm.km(X: Matrix, TE: Matrix, GI: Matrix, SI: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Builtin function that implements the analysis of survival data with KAPLAN-MEIER estimates
- Parameters:
X – Input matrix X containing the survival data: timestamps, whether event occurred (1) or data is censored (0), and a number of factors (categorical features) for grouping and/or stratifying
TE – Column indices of X which contain timestamps (first entry) and event information (second entry)
GI – Column indices of X corresponding to the factors to be used for grouping
SI – Column indices of X corresponding to the factors to be used for stratifying
alpha – Parameter to compute 100*(1-alpha)% confidence intervals for the survivor function and its median
err_type – “greenwood” Parameter to specify the error type according to “greenwood” (the default) or “peto”
conf_type – Parameter to modify the confidence interval; “plain” keeps the lower and upper bound of the confidence interval unmodified, “log” (the default) corresponds to logistic transformation and “log-log” corresponds to the complementary log-log transformation
test_type – If survival data for multiple groups is available specifies which test to perform for comparing survival data across multiple groups: “none” (the default) “log-rank” or “wilcoxon” test
- Returns:
Matrix KM whose dimension depends on the number of groups (denoted by g) and strata (denoted by s) in the data: each collection of 7 consecutive columns in KM corresponds to a unique combination of groups and strata in the data with the following schema 1. col: timestamp 2. col: no. at risk 3. col: no. of events 4. col: Kaplan-Meier estimate of survivor function surv 5. col: standard error of surv 6. col: lower 100*(1-alpha)% confidence interval for surv 7. col: upper 100*(1-alpha)% confidence interval for surv
- Returns:
Matrix M whose dimension depends on the number of groups (g) and strata (s) in the data (k denotes the number of factors used for grouping ,i.e., ncol(GI) and l denotes the number of factors used for stratifying, i.e., ncol(SI)) M[,1:k]: unique combination of values in the k factors used for grouping M[,(k+1):(k+l)]: unique combination of values in the l factors used for stratifying M[,k+l+1]: total number of records M[,k+l+2]: total number of events M[,k+l+3]: median of surv M[,k+l+4]: lower 100*(1-alpha)% confidence interval of the median of surv M[,k+l+5]: upper 100*(1-alpha)% confidence interval of the median of surv If the number of groups and strata is equal to 1, M will have 4 columns with M[,1]: total number of events M[,2]: median of surv M[,3]: lower 100*(1-alpha)% confidence interval of the median of surv M[,4]: upper 100*(1-alpha)% confidence interval of the median of surv
- Returns:
If survival data from multiple groups available and ttype=log-rank or wilcoxon, a 1 x 4 matrix T and an g x 5 matrix T_GROUPS_OE with T_GROUPS_OE[,1] = no. of events T_GROUPS_OE[,2] = observed value (O) T_GROUPS_OE[,3] = expected value (E) T_GROUPS_OE[,4] = (O-E)^2/E T_GROUPS_OE[,5] = (O-E)^2/V T[1,1] = no. of groups T[1,2] = degree of freedom for Chi-squared distributed test statistic T[1,3] = test statistic T[1,4] = P-value
- systemds.operator.algorithm.kmeans(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Builtin function that implements the k-Means clustering algorithm
- Parameters:
X – The input Matrix to do KMeans on.
k – Number of centroids
runs – Number of runs (with different initial centroids)
max_iter – Maximum number of iterations per run
eps – Tolerance (epsilon) for WCSS change ratio
is_verbose – do not print per-iteration stats
avg_sample_size_per_centroid – Average number of records per centroid in data samples
seed – The seed used for initial sampling. If set to -1 random seeds are selected.
- Returns:
The mapping of records to centroids
- Returns:
The output matrix with the centroids
- systemds.operator.algorithm.kmeansPredict(X: Matrix, C: Matrix)
Builtin function that does predictions based on a set of centroids provided.
- Parameters:
X – The input Matrix to do KMeans on.
C – The input Centroids to map X onto.
- Returns:
The mapping of records to centroids
- systemds.operator.algorithm.knn(Train: Matrix, Test: Matrix, CL: Matrix, START_SELECTED: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This script implements KNN (K Nearest Neighbor) algorithm.
- Parameters:
Train – The input matrix as features
Test – The input matrix for nearest neighbor search
CL – The input matrix as target
CL_T – The target type of matrix CL whether columns in CL are continuous ( =1 ) or categorical ( =2 ) or not specified ( =0 )
trans_continuous – Option flag for continuous feature transformed to [-1,1]: FALSE = do not transform continuous variable; TRUE = transform continuous variable;
k_value – k value for KNN, ignore if select_k enable
select_k – Use k selection algorithm to estimate k (TRUE means yes)
k_min – Min k value( available if select_k = 1 )
k_max – Max k value( available if select_k = 1 )
select_feature – Use feature selection algorithm to select feature (TRUE means yes)
feature_max – Max feature selection
interval – Interval value for K selecting ( available if select_k = 1 )
feature_importance – Use feature importance algorithm to estimate each feature (TRUE means yes)
predict_con_tg – Continuous target predict function: mean(=0) or median(=1)
START_SELECTED – feature selection initial value
- Returns:
Applied clusters to X
- Returns:
Cluster matrix
- Returns:
Feature importance value
- systemds.operator.algorithm.knnGraph(X: Matrix, k: int)
Builtin for k nearest neighbor graph construction
- Parameters:
X –
—
k –
—
- Returns:
—
- systemds.operator.algorithm.knnbf(X: Matrix, T: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This script implements KNN (K Nearest Neighbor) algorithm.
- Parameters:
X –
—
T –
—
k_value –
—
- Returns:
—
- systemds.operator.algorithm.l2svm(X: Matrix, Y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This builting function implements binary-class Support Vector Machine (SVM) with squared slack variables (l2 regularization).
- Parameters:
X – Feature matrix X (shape: m x n)
Y – Label vector y of class labels (shape: m x 1), assumed binary in -1/+1 or 1/2 encoding.
intercept – Indicator if a bias column should be added to X and the model
epsilon – Tolerance for early termination if the reduction of objective function is less than epsilon times the initial objective
reg – Regularization parameter (lambda) for L2 regularization
maxIterations – Maximum number of conjugate gradient (outer) iterations
maxii – Maximum number of line search (inner) iterations
verbose – Indicator if training details should be printed
columnId – An optional class ID used in verbose print output, eg. used when L2SVM is used in MSVM.
- Returns:
Trained model/weights (shape: n x 1, w/ intercept: n+1)
- systemds.operator.algorithm.l2svmPredict(X: Matrix, W: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Builtin function Implements binary-class SVM with squared slack variables.
- Parameters:
X – matrix X of feature vectors to classify
W – matrix of the trained variables
verbose – Set to true if one wants print statements.
- Returns:
Classification Labels Raw, meaning not modified to clean labels of 1’s and -1’s
- Returns:
Classification Labels Maxed to ones and zeros.
- systemds.operator.algorithm.lasso(X: Matrix, y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Builtin function for the SpaRSA algorithm to perform lasso regression (SpaRSA .. Sparse Reconstruction by Separable Approximation)
- Parameters:
X – input feature matrix
y – matrix Y columns of the design matrix
tol – target convergence tolerance
M – history length
tau – regularization component
maxi – maximum number of iterations until convergence
verbose – if the builtin should be verbose
- Returns:
model matrix
- systemds.operator.algorithm.lenetPredict(model: List, X: Matrix, C: int, Hin: int, Win: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This builtin function makes prediction given data and trained LeNet model
- Parameters:
model – Trained LeNet model
X – Input data matrix, of shape (N, C*Hin*Win)
C – Number of input channels
Hin – Input height
Win – Input width
batch_size – Batch size
- Returns:
Predicted values
- systemds.operator.algorithm.lenetTrain(X: Matrix, Y: Matrix, X_val: Matrix, Y_val: Matrix, C: int, Hin: int, Win: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This builtin function trains LeNet CNN. The architecture of the networks is:conv1 -> relu1 -> pool1 -> conv2 -> relu2 -> pool2 -> affine3 -> relu3 -> affine4 -> softmax
- Parameters:
X – Input data matrix, of shape (N, C*Hin*Win)
Y – Target matrix, of shape (N, K)
X_val – Validation data matrix, of shape (N, C*Hin*Win)
Y_val – Validation target matrix, of shape (N, K)
C – Number of input channels (dimensionality of input depth)
Hin – Input width
Win – Input height
batch_size – Batch size
epochs – Number of epochs
lr – Learning rate
mu – Momentum value
decay – Learning rate decay
reg – Regularization strength
seed – Seed for model initialization
verbose – Flag indicates if function should print to stdout
- Returns:
Trained model which can be used in lenetPredict
- systemds.operator.algorithm.lm(X: Matrix, y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The lm-function solves linear regression using either the direct solve method or the conjugate gradient algorithm depending on the input size of the matrices (See lmDS-function and lmCG-function respectively).
- Parameters:
X – Matrix of feature vectors.
y – 1-column matrix of response values.
icpt – Intercept presence, shifting and rescaling the columns of X
reg – Regularization constant (lambda) for L2-regularization. set to nonzero for highly dependant/sparse/numerous features
tol – Tolerance (epsilon); conjugate gradient procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm
maxi – Maximum number of conjugate gradient iterations. 0 = no maximum
verbose – If TRUE print messages are activated
- Returns:
The model fit
- systemds.operator.algorithm.lmCG(X: Matrix, y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The lmCG function solves linear regression using the conjugate gradient algorithm
- Parameters:
X – Matrix of feature vectors.
y – 1-column matrix of response values.
icpt – Intercept presence, shifting and rescaling the columns of X
reg – Regularization constant (lambda) for L2-regularization. set to nonzero for highly dependant/sparse/numerous features
tol – Tolerance (epsilon); conjugate gradient procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm
maxi – Maximum number of conjugate gradient iterations. 0 = no maximum
verbose – If TRUE print messages are activated
- Returns:
The model fit
- systemds.operator.algorithm.lmDS(X: Matrix, y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The lmDC function solves linear regression using the direct solve method
- Parameters:
X – Matrix of feature vectors.
y – 1-column matrix of response values.
icpt – Intercept presence, shifting and rescaling the columns of X
reg – Regularization constant (lambda) for L2-regularization. set to nonzero for highly dependant/sparse/numerous features
tol – Tolerance (epsilon); conjugate gradient procedure terminates early if L2 norm of the beta-residual is less than tolerance * its initial norm
maxi – Maximum number of conjugate gradient iterations. 0 = no maximum
verbose – If TRUE print messages are activated
- Returns:
The model fit
- systemds.operator.algorithm.lmPredict(X: Matrix, B: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The lmPredict-function predicts the class of a feature vector
- Parameters:
X – Matrix of feature vectors
B – 1-column matrix of weights.
ytest – test labels, used only for verbose output. can be set to matrix(0,1,1) if verbose output is not wanted
icpt – Intercept presence, shifting and rescaling the columns of X
verbose – If TRUE print messages are activated
- Returns:
1-column matrix of classes
- systemds.operator.algorithm.logSumExp(M: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Built-in LOGSUMEXP
- Parameters:
M – matrix to perform Log sum exp on.
margin – if the logsumexp of rows is required set margin = “row” if the logsumexp of columns is required set margin = “col” if set to “none” then a single scalar is returned computing logsumexp of matrix
- Returns:
a 1*1 matrix, row vector or column vector depends on margin value
- systemds.operator.algorithm.matrixProfile(ts: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Builtin function that computes the MatrixProfile of a time series efficiently using the SCRIMP++ algorithm.
References: Yan Zhu et al.. 2018. Matrix Profile XI: SCRIMP++: Time Series Motif Discovery at Interactive Speeds. 2018 IEEE International Conference on Data Mining (ICDM), 2018, pp. 837-846. DOI: 10.1109/ICDM.2018.00099. https://www.cs.ucr.edu/~eamonn/SCRIMP_ICDM_camera_ready_updated.pdf
- Parameters:
ts – Time series to profile
window_size – Sliding window size
sample_percent – Degree of approximation between zero and one (1 computes the exact solution)
is_verbose – Print debug information
- Returns:
The computed matrix profile
- Returns:
Indices of least distances
- systemds.operator.algorithm.mcc(predictions: Matrix, labels: Matrix)
Built-in function mcc: Matthews’ Correlation Coefficient for binary classification evaluation
- Parameters:
predictions – Vector of predicted 0/1 values. (requires setting ‘labels’ parameter)
labels – Vector of 0/1 labels.
- Returns:
Matthews’ Correlation Coefficient
- systemds.operator.algorithm.mdedup(X: Frame, LHSfeatures: Matrix, LHSthreshold: Matrix, RHSfeatures: Matrix, RHSthreshold: Matrix, verbose: bool)
Implements builtin for deduplication using matching dependencies (e.g. Street 0.95, City 0.90 -> ZIP 1.0) and Jaccard distance.
- Parameters:
X – Input Frame X
LHSfeatures – A matrix 1xd with numbers of columns for MDs (e.g. Street 0.95, City 0.90 -> ZIP 1.0)
LHSthreshold – A matrix 1xd with threshold values in interval [0, 1] for MDs
RHSfeatures – A matrix 1xd with numbers of columns for MDs
RHSthreshold – A matrix 1xd with threshold values in interval [0, 1] for MDs
verbose – To print the output
- Returns:
Matrix nx1 of duplicates
- systemds.operator.algorithm.mice(X: Matrix, cMask: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This Builtin function implements multiple imputation using Chained Equations (MICE)
Assumption missing value are represented with empty string i.e “,,” in CSV file variables with suffix n are storing continuos/numeric data and variables with suffix c are storing categorical data
- Parameters:
X – Data Matrix (Recoded Matrix for categorical features)
cMask – A 0/1 row vector for identifying numeric (0) and categorical features (1)
iter – Number of iteration for multiple imputations
threshold – confidence value [0, 1] for robust imputation, values will only be imputed if the predicted value has probability greater than threshold, only applicable for categorical data
verbose – Boolean value.
- Returns:
imputed dataset
- systemds.operator.algorithm.miceApply(X: Matrix, meta: Matrix, threshold: float, dM: Frame, betaList: List)
This Builtin function implements multiple imputation using Chained Equations (MICE)
Assumption missing value are represented with empty string i.e “,,” in CSV file variables with suffix n are storing continuos/numeric data and variables with suffix c are storing categorical data
- Parameters:
X – Data Matrix (Recoded Matrix for categorical features)
mtea – A meta matrix with each rows storing values 1) mask of original matrix, 2) information of columns with missing values on original data 0 for no missing value in column and 1 otherwise 3) dist values in each columns in original data 1 for continuous columns and colMax for categorical
threshold – confidence value [0, 1] for robust imputation, values will only be imputed if the predicted value has probability greater than threshold, only applicable for categorical data
dM – meta frame from OHE on original data
betaList – List of machine learning models trained for each column imputation
verbose – Boolean value.
- Returns:
imputed dataset
- systemds.operator.algorithm.msvm(X: Matrix, Y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This builtin function implements a multi-class Support Vector Machine (SVM) with squared slack variables. The trained model comprises #classes one-against-the-rest binary-class l2svm classification models.
- Parameters:
X – Feature matrix X (shape: m x n)
Y – Label vector y of class labels (shape: m x 1), where max(Y) is assumed to be the number of classes
intercept – Indicator if a bias column should be added to X and the model
epsilon – Tolerance for early termination if the reduction of objective function is less than epsilon times the initial objective
reg – Regularization parameter (lambda) for L2 regularization
maxIterations – Maximum number of conjugate gradient (outer l2svm) iterations
verbose – Indicator if training details should be printed
- Returns:
Trained model/weights (shape: n x max(Y), w/ intercept: n+1)
- systemds.operator.algorithm.msvmPredict(X: Matrix, W: Matrix)
This Scripts helps in applying an trained MSVM
- Parameters:
X – matrix X of feature vectors to classify
W – matrix of the trained variables
- Returns:
Classification Labels Raw, meaning not modified to clean Labeles of 1’s and -1’s
- Returns:
Classification Labels Maxed to ones and zeros.
- systemds.operator.algorithm.multiLogReg(X: Matrix, Y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Solves Multinomial Logistic Regression using Trust Region method. (See: Trust Region Newton Method for Logistic Regression, Lin, Weng and Keerthi, JMLR 9 (2008) 627-650) The largest label represents the baseline category; if label -1 or 0 is present, then it is the baseline label (and it is converted to the largest label).
- Parameters:
X – Location to read the matrix of feature vectors
Y – Location to read the matrix with category labels
icpt – Intercept presence, shifting and rescaling X columns: 0 = no intercept, no shifting, no rescaling; 1 = add intercept, but neither shift nor rescale X; 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
tol – tolerance (“epsilon”)
reg – regularization parameter (lambda = 1/C); intercept is not regularized
maxi – max. number of outer (Newton) iterations
maxii – max. number of inner (conjugate gradient) iterations, 0 = no max
verbose – flag specifying if logging information should be printed
- Returns:
regression betas as output for prediction
- systemds.operator.algorithm.multiLogRegPredict(X: Matrix, B: Matrix, Y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
THIS SCRIPT APPLIES THE ESTIMATED PARAMETERS OF MULTINOMIAL LOGISTIC REGRESSION TO A NEW (TEST) DATASET Matrix M of predicted means/probabilities, some statistics in CSV format (see below)
- Parameters:
X – Data Matrix X
B – Regression parameters betas
Y – Response vector Y
verbose – flag specifying if logging information should be printed
- Returns:
Matrix M of predicted means/probabilities
- Returns:
Predicted response vector
- Returns:
scalar value of accuracy
- systemds.operator.algorithm.na_locf(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Builtin function for imputing missing values using forward fill and backward fill techniques
- Parameters:
X – Matrix X
option – String “locf” (last observation moved forward) to do forward fill “nocb” (next observation carried backward) to do backward fill
verbose – to print output on screen
- Returns:
Matrix with no missing values
- systemds.operator.algorithm.naiveBayes(D: Matrix, C: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The naiveBayes-function computes the class conditional probabilities and class priors.
- Parameters:
D – One dimensional column matrix with N rows.
C – One dimensional column matrix with N rows.
laplace – Any Double value.
verbose – Boolean value.
- Returns:
Class priors, One dimensional column matrix with N rows.
- Returns:
Class conditional probabilities, One dimensional column matrix with N rows.
- systemds.operator.algorithm.naiveBayesPredict(X: Matrix, P: Matrix, C: Matrix)
The naiveBaysePredict-function predicts the scoring with a naive Bayes model.
- Parameters:
X – Matrix of test data with N rows.
P – Class priors, One dimensional column matrix with N rows.
C – Class conditional probabilities, matrix with N rows
- Returns:
A matrix containing the top-K item-ids with highest predicted ratings.
- Returns:
A matrix containing predicted ratings.
- systemds.operator.algorithm.normalize(X: Matrix)
Min-max normalization (a.k.a. min-max scaling) to range [0,1]. For matrices of positive values, this normalization preserves the input sparsity.
- Parameters:
X – Input feature matrix of shape n-by-m
- Returns:
Modified output feature matrix of shape n-by-m
- Returns:
Column minima of shape 1-by-m
- Returns:
Column maxima of shape 1-by-m
- systemds.operator.algorithm.normalizeApply(X: Matrix, cmin: Matrix, cmax: Matrix)
Min-max normalization (a.k.a. min-max scaling) to range [0,1], given existing min-max ranges. For matrices of positive values, this normalization preserves the input sparsity. The validity of the provided min-max range and post-processing is under control of the caller.
- Parameters:
X – Input feature matrix of shape n-by-m
cmin – Colunm minima of shape 1-by-m
cmax – Column maxima of shape 1-by-m
- Returns:
Modified output feature matrix of shape n-by-m
- systemds.operator.algorithm.outlier(X: Matrix, opposite: bool)
This outlier-function takes a matrix data set as input from where it determines which point(s) have the largest difference from mean.
- Parameters:
X – Matrix of Recoded dataset for outlier evaluation
opposite – (1)TRUE for evaluating outlier from upper quartile range, (0)FALSE for evaluating outlier from lower quartile range
- Returns:
matrix indicating outlier values
- systemds.operator.algorithm.outlierByArima(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Built-in function for detecting and repairing outliers in time series, by training an ARIMA model and classifying values that are more than k standard-deviations away from the predicated values as outliers.
- Parameters:
X – Matrix X
k – threshold values 1, 2, 3 for 68%, 95%, 99.7% respectively (3-sigma rule)
repairMethod – values: 0 = delete rows having outliers, 1 = replace outliers as zeros 2 = replace outliers as missing values
p – non-seasonal AR order
d – non-seasonal differencing order
q – non-seasonal MA order
P – seasonal AR order
D – seasonal differencing order
Q – seasonal MA order
s – period in terms of number of time-steps
include_mean – If the mean should be included
solver – solver, is either “cg” or “jacobi”
- Returns:
Matrix X with no outliers
- systemds.operator.algorithm.outlierByIQR(X: Matrix, k: float, max_iterations: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Builtin function for detecting and repairing outliers using standard deviation
- Parameters:
X – Matrix X
k – a constant used to discern outliers k*IQR
isIterative – iterative repair or single repair
repairMethod – values: 0 = delete rows having outliers, 1 = replace outliers with zeros 2 = replace outliers as missing values
max_iterations – values: 0 = arbitrary number of iteraition until all outliers are removed, n = any constant defined by user
verbose – flag specifying if logging information should be printed
- Returns:
Matrix X with no outliers
- systemds.operator.algorithm.outlierByIQRApply(X: Matrix, Q1: Matrix, Q3: Matrix, IQR: Matrix, k: float, repairMethod: int)
Builtin function for repairing outliers by IQR
- Parameters:
X – Matrix X
Q1 – first quartile
Q3 – third quartile
IQR – Inter-quartile range
k – a constant used to discern outliers k*IQR
repairMethod – values: 0 = delete rows having outliers, 1 = replace outliers with zeros 2 = replace outliers as missing values
- Returns:
Matrix X with no outliers
- systemds.operator.algorithm.outlierBySd(X: Matrix, max_iterations: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Builtin function for detecting and repairing outliers using standard deviation
- Parameters:
X – Matrix X
k – threshold values 1, 2, 3 for 68%, 95%, 99.7% respectively (3-sigma rule)
repairMethod – values: 0 = delete rows having outliers, 1 = replace outliers as zeros 2 = replace outliers as missing values
max_iterations – values: 0 = arbitrary number of iteration until all outliers are removed, n = any constant defined by user
- Returns:
Matrix X with no outliers
- systemds.operator.algorithm.outlierBySdApply(X: Matrix, colMean: Matrix, colSD: Matrix, k: float, repairMethod: int)
Builtin function for detecting and repairing outliers using standard deviation
- Parameters:
X – Matrix X
colMean – Matrix X
k – a constant used to discern outliers k*IQR
isIterative – iterative repair or single repair
repairMethod – values: 0 = delete rows having outliers, 1 = replace outliers with zeros 2 = replace outliers as missing values
max_iterations – values: 0 = arbitrary number of iteraition until all outliers are removed, n = any constant defined by user
verbose – flag specifying if logging information should be printed
- Returns:
Matrix X with no outliers
- systemds.operator.algorithm.pca(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The function Principal Component Analysis (PCA) is used for dimensionality reduction
- Parameters:
X – Input feature matrix
K – Number of reduced dimensions (i.e., columns)
Center – Indicates whether or not to center the feature matrix
Scale – Indicates whether or not to scale the feature matrix
- Returns:
Output feature matrix with K columns
- Returns:
Output dominant eigen vectors (can be used for projections)
- Returns:
The column means of the input, subtracted to construct the PCA
- Returns:
The Scaling of the values, to make each dimension same size.
- systemds.operator.algorithm.pcaInverse(Y: Matrix, Clusters: Matrix, Centering: Matrix, ScaleFactor: Matrix)
Principal Component Analysis (PCA) for reconstruction of approximation of the original data. This methods allows to reconstruct an approximation of the original matrix, and is useful for calculating how much information is lost in the PCA.
- Parameters:
Y – Input features that have PCA applied to them
Clusters – The previous PCA components computed
Centering – The column means of the PCA model, subtracted to construct the PCA
ScaleFactor – The scaling of each dimension in the PCA model
- Returns:
Output feature matrix reconstructing and approximation of the original matrix
- systemds.operator.algorithm.pcaTransform(X: Matrix, Clusters: Matrix, Centering: Matrix, ScaleFactor: Matrix)
Principal Component Analysis (PCA) for dimensionality reduction prediction This method is used to transpose data, which the PCA model was not trained on. To validate how good The PCA is, and to apply in production.
- Parameters:
X – Input feature matrix
Clusters – The previously computed principal components
Centering – The column means of the PCA model, subtracted to construct the PCA
ScaleFactor – The scaling of each dimension in the PCA model
- Returns:
Output feature matrix dimensionally reduced by PCA
- systemds.operator.algorithm.pnmf(X: Matrix, rnk: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The pnmf-function implements Poisson Non-negative Matrix Factorization (PNMF). Matrix X is factorized into two non-negative matrices, W and H based on Poisson probabilistic assumption. This non-negativity makes the resulting matrices easier to inspect.
[Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang: Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. WWW 2010: 681-690]
- Parameters:
X – Matrix of feature vectors.
rnk – Number of components into which matrix X is to be factored.
eps – Tolerance
maxi – Maximum number of conjugate gradient iterations.
verbose – If TRUE, ‘iter’ and ‘obj’ are printed.
- Returns:
List of pattern matrices, one for each repetition.
- Returns:
List of amplitude matrices, one for each repetition.
- systemds.operator.algorithm.ppca(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This script performs Probabilistic Principal Component Analysis (PCA) on the given input data. It is based on paper: sPCA: Scalable Principal Component Analysis for Big Data on Distributed Platforms. Tarek Elgamal et.al.
- Parameters:
X – n x m input feature matrix
k – indicates dimension of the new vector space constructed from eigen vectors
maxi – maximum number of iterations until convergence
tolobj – objective function tolerance value to stop ppca algorithm
tolrecerr – reconstruction error tolerance value to stop the algorithm
verbose – verbose debug output
- Returns:
Output feature matrix with K columns
- Returns:
Output dominant eigen vectors (can be used for projections)
- systemds.operator.algorithm.randomForest(X: Matrix, Y: Matrix, R: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This script implement classification random forest with both scale and categorical features.
- Parameters:
X – Feature matrix X; note that X needs to be both recoded and dummy coded
Y – Label matrix Y; note that Y needs to be both recoded and dummy coded
R – Matrix which for each feature in X contains the following information - R[,1]: column ids TODO pass recorded and binned - R[,2]: start indices - R[,3]: end indices If R is not provided by default all variables are assumed to be scale
bins – Number of equiheight bins per scale feature to choose thresholds
depth – Maximum depth of the learned tree
num_leaf – Number of samples when splitting stops and a leaf node is added
num_samples – Number of samples at which point we switch to in-memory subtree building
num_trees – Number of trees to be learned in the random forest model
subsamp_rate – Parameter controlling the size of each tree in the forest; samples are selected from a Poisson distribution with parameter subsamp_rate (the default value is 1.0)
feature_subset – Parameter that controls the number of feature used as candidates for splitting at each tree node as a power of number of features in the dataset; by default square root of features (i.e., feature_subset = 0.5) are used at each tree node
impurity – Impurity measure: entropy or Gini (the default)
- Returns:
Matrix M containing the learned tree, where each column corresponds to a node in the learned tree and each row contains the following information: M[1,j]: id of node j (in a complete binary tree) M[2,j]: tree id to which node j belongs M[3,j]: Offset (no. of columns) to left child of j M[4,j]: Feature index of the feature that node j looks at if j is an internal node, otherwise 0 M[5,j]: Type of the feature that node j looks at if j is an internal node: 1 for scale and 2 for categorical features, otherwise the label that leaf node j is supposed to predict M[6,j]: 1 if j is an internal node and the feature chosen for j is scale, otherwise the size of the subset of values stored in rows 7,8,… if j is categorical M[7:,j]: Only applicable for internal nodes. Threshold the example’s feature value is compared to is stored at M[7,j] if the feature chosen for j is scale; If the feature chosen for j is categorical rows 7,8,… depict the value subset chosen for j
- Returns:
Matrix C containing the number of times samples are chosen in each tree of the random forest
- Returns:
Mappings from scale feature ids to global feature ids
- Returns:
Mappings from categorical feature ids to global feature ids
- systemds.operator.algorithm.scale(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This function scales and center individual features in the input matrix (column wise.) using z-score to scale the values.
- Parameters:
X – Input feature matrix
center – Indicates whether or not to center the feature matrix
scale – Indicates whether or not to scale the feature matrix
- Returns:
Output feature matrix with K columns
- Returns:
The column means of the input, subtracted if Center was TRUE
- Returns:
The Scaling of the values, to make each dimension have similar value ranges
- systemds.operator.algorithm.scaleApply(X: Matrix, Centering: Matrix, ScaleFactor: Matrix)
This function scales and center individual features in the input matrix (column wise.) using the input matrices.
- Parameters:
X – Input feature matrix
Centering – The column means to subtract from X (not done if empty)
ScaleFactor – The column scaling to multiply with X (not done if empty)
- Returns:
Output feature matrix with K columns
- systemds.operator.algorithm.scaleMinMax(X: Matrix)
This function performs min-max normalization (rescaling to [0,1]).
- Parameters:
X – Input feature matrix
- Returns:
Scaled output matrix
- systemds.operator.algorithm.selectByVarThresh(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This function drops feature with <= thresh variance (by default drop constants).
- Parameters:
X – Matrix of feature vectors.
thresh – The threshold for to drop
- Returns:
Matrix of feature vectors with <= thresh variance.
- systemds.operator.algorithm.setdiff(X: Matrix, Y: Matrix)
Builtin function that implements difference operation on vectors
- Parameters:
X – input vector
Y – input vector
- Returns:
vector with all elements that are present in X but not in Y
- systemds.operator.algorithm.sherlock(X_train: Matrix, y_train: Matrix)
This function implements training phase of Sherlock: A Deep Learning Approach to Semantic Data Type Detection
[Hulsebos, Madelon, et al. “Sherlock: A deep learning approach to semantic data type detection.” Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019.]
Split feature matrix into four different feature categories and train neural networks on the respective single features. Then combine all trained features to train final neural network.
- Parameters:
X_train – matrix of feature vectors
y_train – matrix Y of class labels of semantic data type
- Returns:
weights (parameters) matrices for character distributions
- Returns:
biases vectors for character distributions
- Returns:
weights (parameters) matrices for word embeddings
- Returns:
biases vectors for word embeddings
- Returns:
weights (parameters) matrices for paragraph vectors
- Returns:
biases vectors for paragraph vectors
- Returns:
weights (parameters) matrices for global statistics
- Returns:
biases vectors for global statistics
- Returns:
weights (parameters) matrices for combining all trained features (final)
- Returns:
biases vectors for combining all trained features (final)
- systemds.operator.algorithm.sherlockPredict(X: Matrix, cW1: Matrix, cb1: Matrix, cW2: Matrix, cb2: Matrix, cW3: Matrix, cb3: Matrix, wW1: Matrix, wb1: Matrix, wW2: Matrix, wb2: Matrix, wW3: Matrix, wb3: Matrix, pW1: Matrix, pb1: Matrix, pW2: Matrix, pb2: Matrix, pW3: Matrix, pb3: Matrix, sW1: Matrix, sb1: Matrix, sW2: Matrix, sb2: Matrix, sW3: Matrix, sb3: Matrix, fW1: Matrix, fb1: Matrix, fW2: Matrix, fb2: Matrix, fW3: Matrix, fb3: Matrix)
This function implements prediction and evaluation phase of Sherlock: Split feature matrix into four different feature categories and predicting the class probability on the respective features. Then combine all predictions for final predicted probabilities. A Deep Learning Approach to Semantic Data Type Detection. [Hulsebos, Madelon, et al. “Sherlock: A deep learning approach to semantic data type detection.” Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019.]
- Parameters:
X – matrix of values which are to be classified
cW – weights (parameters) matrices for character distribtions
cb – biases vectors for character distribtions
wW – weights (parameters) matrices for word embeddings
wb – biases vectors for word embeddings
pW – weights (parameters) matrices for paragraph vectors
pb – biases vectors for paragraph vectors
sW – weights (parameters) matrices for global statistics
sb – biases vectors for global statistics
fW – weights (parameters) matrices for combining all trained features (final)
fb – biases vectors for combining all trained features (final)
- Returns:
class probabilities of shape (N, K)
- systemds.operator.algorithm.shortestPath(G: Matrix, sourceNode: int, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Computes the minimum distances (shortest-path) between a single source vertex and every other vertex in the graph.
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bilk, James C. Dehnert, Ikkan Horn, Naty Leiser and Grzegorz Czajkowski: Pregel: A System for Large-Scale Graph Processing, SIGMOD 2010
- Parameters:
G – adjacency matrix of the labeled graph: Such graph can be directed (G is symmetric) or undirected (G is not symmetric). The values of G can be 0/1 (just specifying whether the nodes are connected or not) or integer values (representing the weight of the edges or the distances between nodes, 0 if not connected).
maxi – Integer max number of iterations accepted (0 for FALSE, i.e. max number of iterations not defined)
sourceNode – node index to calculate the shortest paths to all other nodes.
verbose – flag for verbose debug output
- Returns:
Output matrix (double) of minimum distances (shortest-path) between vertices: The value of the ith row and the jth column of the output matrix is the minimum distance shortest-path from vertex i to vertex j. When the value of the minimum distance is infinity, the two nodes are not connected.
- systemds.operator.algorithm.sigmoid(X: Matrix)
The Sigmoid function is a type of activation function, and also defined as a squashing function which limit the output to a range between 0 and 1, which will make these functions useful in the prediction of probabilities.
- Parameters:
X – Matrix of feature vectors.
- Returns:
1-column matrix of weights.
- systemds.operator.algorithm.slicefinder(X: Matrix, e: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This builtin function implements SliceLine, a linear-algebra-based ML model debugging technique for finding the top-k data slices where a trained models performs significantly worse than on the overall dataset. For a detailed description and experimental results, see: Svetlana Sagadeeva, Matthias Boehm: SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging.(SIGMOD 2021)
- Parameters:
X – Recoded dataset into Matrix
e – Trained model
k – Number of subsets required
maxL – maximum level L (conjunctions of L predicates), 0 unlimited
minSup – minimum support (min number of rows per slice)
alpha – weight [0,1]: 0 only size, 1 only error
tpEval – flag for task-parallel slice evaluation, otherwise data-parallel
tpBlksz – block size for task-parallel execution (num slices)
selFeat – flag for removing one-hot-encoded features that don’t satisfy the initial minimum-support constraint and/or have zero error
verbose – flag for verbose debug output
- Returns:
top-k slices (k x ncol(X) if successful)
- Returns:
score, size, error of slices (k x 3)
- Returns:
debug matrix, populated with enumeration stats if verbose
- systemds.operator.algorithm.smote(X: Matrix, mask: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
Builtin function for handing class imbalance using Synthetic Minority Over-sampling Technique (SMOTE) by Nitesh V. Chawla et. al. In Journal of Artificial Intelligence Research 16 (2002). 321–357
- Parameters:
X – Matrix of minority class samples
mask – 0/1 mask vector where 0 represent numeric value and 1 represent categorical value
s – Amount of SMOTE (percentage of oversampling), integral multiple of 100
k – Number of nearest neighbor
verbose – if the algorithm should be verbose
- Returns:
Matrix of (N/100)-1 * nrow(X) synthetic minority class samples
- systemds.operator.algorithm.softmax(S: Matrix)
Performs softmax on the given input matrix.
- Parameters:
S – Inputs of shape (N, D).
- Returns:
Outputs of shape (N, D).
- systemds.operator.algorithm.split(X: Matrix, Y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This function split input data X and Y into contiguous or samples train/test sets
- Parameters:
X – Input feature matrix
Y – Input Labels
f – Train set fraction [0,1]
cont – contiguous splits, otherwise sampled
seed – The seed to randomly select rows in sampled mode
- Returns:
Train split of feature matrix
- Returns:
Test split of feature matrix
- Returns:
Train split of label matrix
- Returns:
Test split of label matrix
- systemds.operator.algorithm.splitBalanced(X: Matrix, Y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This functions split input data X and Y into contiguous balanced ratio Related to [SYSTEMDS-2902] dependency function for cleaning pipelines
- Parameters:
X – Input feature matrix
Y – Input Labels
f – Train set fraction [0,1]
verbose – print available
- Returns:
Train split of feature matrix
- Returns:
Test split of feature matrix
- Returns:
Train split of label matrix
- Returns:
Test split of label matrix
- systemds.operator.algorithm.stableMarriage(P: Matrix, A: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This script computes a solution for the stable marriage problem.
result description:
If cell [i,j] is non-zero, it means that acceptor i has matched with proposer j. Further, if cell [i,j] is non-zero, it holds the preference value that led to the match. Proposers.mtx: 2.0,1.0,3.0 1.0,2.0,3.0 1.0,3.0,2.0
Since ordered=TRUE, this means that proposer 1 (row 1) likes acceptor 2 the most, followed by acceptor 1 and acceptor 3. If ordered=FALSE, this would mean that proposer 1 (row 1) likes acceptor 3 the most (since the value at [1,3] is the row max), followed by acceptor 1 (2.0 preference value) and acceptor 2 (1.0 preference value).
Acceptors.mtx: 3.0,1.0,2.0 2.0,1.0,3.0 3.0,2.0,1.0
Since ordered=TRUE, this means that acceptor 1 (row 1) likes proposer 3 the most, followed by proposer 1 and proposer 2. If ordered=FALSE, this would mean that acceptor 1 (row 1) likes proposer 1 the most (since the value at [1,1] is the row max), followed by proposer 3 (2.0 preference value) and proposer 2 (1.0 preference value).
Output.mtx (assuming ordered=TRUE): 0.0,0.0,3.0 0.0,3.0,0.0 1.0,0.0,0.0
Acceptor 1 has matched with proposer 3 (since [1,3] is non-zero) at a preference level of 3.0. Acceptor 2 has matched with proposer 2 (since [2,2] is non-zero) at a preference level of 3.0. Acceptor 3 has matched with proposer 1 (since [3,1] is non-zero) at a preference level of 1.0.
- Parameters:
P – proposer matrix P. It must be a square matrix with no zeros.
A – acceptor matrix A. It must be a square matrix with no zeros.
ordered – If true, P and A are assumed to be ordered, i.e. the leftmost value in a row is the most preferred partner’s index. i.e. the leftmost value in a row in P is the preference value for the acceptor with index 1 and vice-versa (higher is better).
verbose – if the algorithm should print verbosely
- Returns:
Result Matrix
- systemds.operator.algorithm.statsNA(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The statsNA-function Print summary stats about the distribution of missing values in a univariate time series.
- result matrix contains the following:
Length of time series (including NAs)
Number of Missing Values (NAs)
Percentage of Missing Values (#2/#1)
Number of Gaps (consisting of one or more consecutive NAs)
Average Gap Size - Average size of consecutive NAs for the NA gaps
Longest NA gap - Longest series of consecutive missing values
Most frequent gap size - Most frequently occurring gap size
Gap size accounting for most NAs
- Parameters:
X – Numeric Vector (‘vector’) object containing NAs
bins – Split number for bin stats. Number of bins the time series gets divided into. For each bin information about amount/percentage of missing values is printed.
verbose – Print detailed information. For print_only = TRUE, the missing value stats are printed with more information (“Stats for Bins” and “overview NA series”).
- Returns:
Column vector where each row correspond to described values
- systemds.operator.algorithm.steplm(X: Matrix, y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The steplm-function (stepwise linear regression) implements a classical forward feature selection method. This method iteratively runs what-if scenarios and greedily selects the next best feature until the Akaike information criterion (AIC) does not improve anymore. Each configuration trains a regression model via lm, which in turn calls either the closed form lmDS or iterative lmGC.
return: Matrix of regression parameters (the betas) and its size depend on icpt input value: OUTPUT SIZE: OUTPUT CONTENTS: HOW TO PREDICT Y FROM X AND B: icpt=0: ncol(X) x 1 Betas for X only Y ~ X %*% B[1:ncol(X), 1], or just X %*% B icpt=1: ncol(X)+1 x 1 Betas for X and intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1] icpt=2: ncol(X)+1 x 2 Col.1: betas for X & intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1] Col.2: betas for shifted/rescaled X and intercept
In addition, in the last run of linear regression some statistics are provided in CSV format, one comma-separated name-value pair per each line, as follows:
- Parameters:
X – Location (on HDFS) to read the matrix X of feature vectors
Y – Location (on HDFS) to read the 1-column matrix Y of response values
icpt – Intercept presence, shifting and rescaling the columns of X: 0 = no intercept, no shifting, no rescaling; 1 = add intercept, but neither shift nor rescale X; 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
reg – learning rate
tol – Tolerance threshold to train until achieved
maxi – maximum iterations 0 means until tolerance is reached
verbose – If the algorithm should be verbose
- Returns:
Matrix of regression parameters (the betas) and its size depend on icpt input value.
- Returns:
Matrix of selected features ordered as computed by the algorithm.
- systemds.operator.algorithm.stratstats(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The stratstats.dml script computes common bivariate statistics, such as correlation, slope, and their p-value, in parallel for many pairs of input variables in the presence of a confounding categorical variable.
Output contains: (1st covariante, 2nd covariante) 40 columns containing the following information: Col 01: 1st covariate X-column number Col 02: 1st covariate global presence count Col 03: 1st covariate global mean Col 04: 1st covariate global standard deviation Col 05: 1st covariate stratified standard deviation Col 06: R-squared, 1st covariate vs. strata Col 07: adjusted R-squared, 1st covariate vs. strata Col 08: P-value, 1st covariate vs. strata Col 09-10: Reserved Col 11: 2nd covariate Y-column number Col 12: 2nd covariate global presence count Col 13: 2nd covariate global mean Col 14: 2nd covariate global standard deviation Col 15: 2nd covariate stratified standard deviation Col 16: R-squared, 2nd covariate vs. strata Col 17: adjusted R-squared, 2nd covariate vs. strata Col 18: P-value, 2nd covariate vs. strata Col 19-20: Reserved Col 21: Global 1st & 2nd covariate presence count Col 22: Global regression slope (2nd vs. 1st covariate) Col 23: Global regression slope standard deviation Col 24: Global correlation = +/- sqrt(R-squared) Col 25: Global residual standard deviation Col 26: Global R-squared Col 27: Global adjusted R-squared Col 28: Global P-value for hypothesis “slope = 0” Col 29-30: Reserved Col 31: Stratified 1st & 2nd covariate presence count Col 32: Stratified regression slope (2nd vs. 1st covariate) Col 33: Stratified regression slope standard deviation Col 34: Stratified correlation = +/- sqrt(R-squared) Col 35: Stratified residual standard deviation Col 36: Stratified R-squared Col 37: Stratified adjusted R-squared Col 38: Stratified P-value for hypothesis “slope = 0” Col 39: Number of strata with at least two counted points Col 40: Reserved
- Parameters:
X – Matrix X that has all 1-st covariates
Y – Matrix Y that has all 2-nd covariates the default value empty means “use X in place of Y”
S – Matrix S that has the stratum column the default value empty means “use X in place of S”
Xcid – 1-st covariate X-column indices the default value empty means “use columns 1 : ncol(X)”
Ycid – 2-nd covariate Y-column indices the default value empty means “use columns 1 : ncol(Y)”
Scid – Column index of the stratum column in S
- Returns:
Output matrix, one row per each distinct pair
- systemds.operator.algorithm.symmetricDifference(X: Matrix, Y: Matrix)
Builtin function that implements symmetric difference set-operation on vectors
- Parameters:
X – input vector
Y – input vector
- Returns:
vector with all elements in X and Y but not in both
- systemds.operator.algorithm.tSNE(X: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This function performs dimensionality reduction using tSNE algorithm based on the paper: Visualizing Data using t-SNE, Maaten et. al.
- Parameters:
X – Data Matrix of shape (number of data points, input dimensionality)
reduced_dims – Output dimensionality
perplexity – Perplexity Parameter
lr – Learning rate
momentum – Momentum Parameter
max_iter – Number of iterations
seed – The seed used for initial values. If set to -1 random seeds are selected.
is_verbose – Print debug information
- Returns:
Data Matrix of shape (number of data points, reduced_dims)
- systemds.operator.algorithm.toOneHot(X: Matrix, numClasses: int)
The toOneHot-function encodes unordered categorical vector to multiple binary vectors.
- Parameters:
X – Vector with N integer entries between 1 and numClasses
numclasses – Number of columns, must be be greater than or equal to largest value in X
- Returns:
One-hot-encoded matrix with shape (N, numClasses)
- systemds.operator.algorithm.tomeklink(X: Matrix, y: Matrix)
The tomekLink-function performs under sampling by removing Tomek’s links for imbalanced multi-class problems Computes TOMEK links and drops them from data matrix and label vector. Drops only the majority label and corresponding point of TOMEK links.
- Parameters:
X – Data Matrix (nxm)
y – Label Matrix (nx1), greater than zero
- Returns:
Data Matrix without Tomek links
- Returns:
Labels corresponding to under sampled data
- Returns:
Indices of dropped rows/labels wrt input
- systemds.operator.algorithm.topk_cleaning(dataTrain: Frame, primitives: Frame, parameters: Frame, evaluationFunc: str, evalFunHp: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
This function cleans top-K item (where K is given as input)for a given list of users. metaData[3, ncol(X)] : metaData[1] stores mask, metaData[2] stores schema, metaData[3] stores FD mask
- systemds.operator.algorithm.underSampling(X: Matrix, Y: Matrix, ratio: float)
Builtin to perform random under sampling on data.
- Parameters:
X – X data to sample from
Y – Y data to sample from it will sample the same rows from x.
ratio – The ratio to sample
- Returns:
The under sample data X
- Returns:
The under sample data Y
- systemds.operator.algorithm.union(X: Matrix, Y: Matrix)
Builtin function that implements union operation on vectors
- Parameters:
X – input vector
Y – input vector
- Returns:
matrix with all unique rows existing in X and Y
- systemds.operator.algorithm.univar(X: Matrix, types: Matrix)
Computes univariate statistics for all attributes in a given data set
- Parameters:
X – Input matrix of the shape (N, D)
TYPES – Matrix of the shape (1, D) with features types: 1 for scale, 2 for nominal, 3 for ordinal
- Returns:
univariate statistics for all attributes
- systemds.operator.algorithm.vectorToCsv(mask: Matrix)
This builtin function convert vector into csv string such as [1 0 0 1 1 0 1] = “1,4,5,7” Related to [SYSTEMDS-2662] dependency function for cleaning pipelines
- Parameters:
mask – Data vector (having 0 for excluded indexes)
- Returns:
indexes
- systemds.operator.algorithm.winsorize(X: Matrix, verbose: bool, **kwargs: Dict[str, DAGNode | str | int | float | bool])
The winsorize-function removes outliers from the data. It does so by computing upper and lower quartile range of the given data then it replaces any value that falls outside this range (less than lower quartile range or more than upper quartile range).
- Parameters:
X – Input feature matrix
verbose – To print output on screen
- Returns:
Matrix without outlier values
- systemds.operator.algorithm.winsorizeApply(X: Matrix, qLower: Matrix, qUpper: Matrix)
winsorizeApply takes the upper and lower quantile values per column, and remove outliers by replacing them with these upper and lower bound values.
- Parameters:
X – Input feature matrix
qLower – row vector of upper bounds per column
qUpper – row vector of lower bounds per column
- Returns:
Matrix without outlier values
- systemds.operator.algorithm.xdummy1(X: Matrix)
This builtin function is here for debugging purposes
- Parameters:
X – test input
- Returns:
test result
- systemds.operator.algorithm.xdummy2(X: Matrix)
This builtin function is here for debugging purposes
- Parameters:
X – Debug input
- Returns:
—
- Returns:
—
- systemds.operator.algorithm.xgboost(X: Matrix, y: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting. This xgboost implementation supports classification and regression and is capable of working with categorical and scalar features.
Output explained: (the first node is the init prediction) and each row contains the following information: M[1,j]: id of node j (in a complete binary tree) M[2,j]: tree id to which node j belongs M[3,j]: Offset (no. of columns) to left child of j if j is an internal node, otherwise 0 M[4,j]: Feature index of the feature (scale feature id if the feature is scale or categorical feature id if the feature is categorical) that node j looks at if j is an internal node, otherwise 0 M[5,j]: Type of the feature that node j looks at if j is an internal node. if leaf = 0, if scalar = 1, if categorical = 2 M[6:,j]: If j is an internal node: Threshold the example’s feature value is compared to is stored at M[6,j] if the feature chosen for j is scale, otherwise if the feature chosen for j is categorical rows 6,7,… depict the value subset chosen for j If j is a leaf node 1 if j is impure and the number of samples at j > threshold, otherwise 0
- Parameters:
X – Feature matrix X; note that X needs to be both recoded and dummy coded
y – Label matrix y; note that y needs to be both recoded and dummy coded
R – Matrix R; 1xn vector which for each feature in X contains the following information - R[,1]: 1 (scalar feature) - R[,2]: 2 (categorical feature) Feature 1 is a scalar feature and features 2 is a categorical feature If R is not provided by default all variables are assumed to be scale (1)
sml_type – Supervised machine learning type: 1 = Regression(default), 2 = Classification
num_trees – Number of trees to be created in the xgboost model
learning_rate – Alias: eta. After each boosting step the learning rate controls the weights of the new predictions
max_depth – Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit
lambda – L2 regularization term on weights. Increasing this value will make model more conservative and reduce amount of leaves of a tree
- Returns:
Matrix M where each column corresponds to a node in the learned tree
- systemds.operator.algorithm.xgboostPredictClassification(X: Matrix, M: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting. This xgboost implementation supports classification and is capable of working with categorical features.
- Parameters:
X – Matrix of feature vectors we want to predict (X_test)
M – The model created at xgboost
learning_rate – The learning rate used in the model
- Returns:
The predictions of the samples using the given xgboost model. (y_prediction)
- systemds.operator.algorithm.xgboostPredictRegression(X: Matrix, M: Matrix, **kwargs: Dict[str, DAGNode | str | int | float | bool])
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting. This xgboost implementation supports regression.
- Parameters:
X – Matrix of feature vectors we want to predict (X_test)
M – The model created at xgboost
learning_rate – The learning rate used in the model
- Returns:
The predictions of the samples using the given xgboost model. (y_prediction)