1.0 | 2.0 | 3.0 |
0 | 0 | 0 |
7.0 | 8.0 | 9.0 |
0 | 0 | 0 |
OMIT | MVI | RCD | BIN | DCD | SCL | |
---|---|---|---|---|---|---|
OMIT | - | x | * | * | * | * |
MVI | x | - | * | * | * | * |
RCD | * | * | - | x | * | x |
BIN | * | * | x | - | * | x |
DCD | * | * | * | * | - | x |
SCL | * | * | x | x | x | - |
Key | Meaning |
---|---|
OMIT | Missing value handling by omitting |
MVI | Missing value handling by imputation |
RCD | Recoding |
BIN | Binning |
DCD | Dummycoding |
SCL | Scaling |
Key | Meaning |
---|---|
* | Combination is allowed |
x | Combination is invalid |
- | Combination is not applicable |
data_type="frame"
when reading data from a file. Input formats csv, text, and binary are supported.
A = read("fileA", data_type="frame", rows=10, cols=8);
B = read("dataB", data_type="frame", rows=3, cols=3, format="csv");
A schema can be specified when creating a `frame` where the schema is a string containing a value type per column. The supported value types for a schema are `string`, `double`, `int`, `boolean`. Note schema=""
resolves to a string schema and if no schema is specified, the default is ""
.
This example shows creating a frame with schema="string,double,int,boolean"
since the data has four columns (one of each supported value type).
tableSchema = "string,double,int,boolean";
C = read("tableC", data_type="frame", schema=tableSchema, rows=1600, cols=4, format="csv");
*Note: the header line in frame CSV files is sensitive to white spaces.* ID,FirstName,LastName
results in three columns with tokens between separators. In contrast, CSV2 with header ID, FirstName,LastName
also results in three columns but the second column has a space preceding FirstName
. This extra space is significant when referencing the second column by name in transform specifications as described in [Transforming Frames](dml-language-reference.html#transforming-frames).
cbind()
and rbind()
are supported for frames to add columns or rows to an existing frame.
**Table F1**: Frame Append Built-In Functions
Function | Description | Parameters | Example
-------- | ----------- | ---------- | -------
cbind() | Column-wise frame concatenation. Concatenates the second frame as additional columns to the first frame. | Input: (X <frame>, Y <frame>) as.frame(), as.matrix()
and as.scalar()
. Casting a frame to a matrix is a best effort operation, which tries to parse doubles. If there are strings that cannot be parsed, the as.frame()
operation produces errors. For example, a java.lang.NumberFormatException
may occur for invalid data since Java's Double.parseDouble()
is used internally for parsing.
**Table F2**: Casting Built-In Functions
Function | Description | Parameters | Example
-------- | ----------- | ---------- | -------
as.frame(<matrix>) | Matrix is cast to frame. | Input: (<matrix>) as.frame(matrix)
produces a double schema, and as.scalar(frame)
produces of scalar of value type given by the frame schema.*
### Transforming Frames
Frames support additional [Data Pre-Processing Built-In Functions](dml-language-reference.html#data-pre-processing-built-in-functions) as shown below.
Function | Description | Parameters | Example
-------- | ----------- | ---------- | -------
transformencode() | Transforms a frame into a matrix using specification. transformencode(), transformdecode(), transformapply()
. Note only recoding, dummy coding and pass-through are reversible, i.e., subject to transformdecode()
, whereas binning, missing value imputation, and omit are not.
**Table F3**: Frame data transformation types.
encode | decode | apply | |
---|---|---|---|
RCD | * | * | * |
DCD | * | * | * |
BIN | * | x | * |
MVI | * | x | * |
OMIT | * | x | * |
Key | Meaning |
---|---|
RCD | Recoding |
DCD | Dummycoding |
BIN | Binning |
MVI | Missing value handling by imputation |
OMIT | Missing value handling by omitting |
Key | Meaning |
---|---|
* | Supported |
x | Not supported |
transformencode()
function takes a frame and outputs a matrix based on defined transformation specification. In addition, the corresponding metadata is output as a frame
.
*Note: the metadata output is simply a frame so all frame operations (including read/write) can also be applied to the metadata.*
This example replaces values in specific columns to create a recoded matrix with associated frame identifying the mapping between original and substituted values. An example transformation specification file [`homes.tfspec_recode2.json`](files/dml-language-reference/homes.tfspec_recode2.json) is given below:
{
"recode": [ "zipcode", "district", "view" ]
}
The following DML utilizes the `transformencode()` function.
F1 = read("/user/ml/homes.csv", data_type="frame", format="csv");
jspec = read("/user/ml/homes.tfspec_recode2.json", data_type="scalar", value_type="string");
[X, M] = transformencode(target=F1, spec=jspec);
print(toString(X));
while(FALSE){}
print(toString(M));
The transformed matrix X and output M are as follows.
1.000 1.000 1373.000 7.000 1.000 3.000 1.000 695.000 698.000
2.000 2.000 3261.000 6.000 2.000 2.000 1.000 902.000 906.000
3.000 3.000 1835.000 3.000 3.000 3.000 2.000 888.000 892.000
1.000 4.000 2833.000 6.000 2.500 2.000 2.000 927.000 932.000
4.000 2.000 2742.000 6.000 2.500 2.000 1.000 872.000 876.000
4.000 3.000 2195.000 5.000 2.500 2.000 1.000 799.000 803.000
5.000 3.000 3469.000 7.000 2.500 2.000 1.000 958.000 963.000
4.000 1.000 1685.000 7.000 1.500 2.000 2.000 757.000 760.000
1.000 1.000 2238.000 4.000 3.000 3.000 1.000 894.000 899.000
2.000 1.000 1245.000 4.000 1.000 1.000 1.000 547.000 549.000
5.000 2.000 3702.000 7.000 3.000 1.000 1.000 959.000 964.000
5.000 3.000 1865.000 7.000 1.000 2.000 2.000 742.000 745.000
3.000 3.000 3837.000 3.000 1.000 1.000 1.000 839.000 842.000
2.000 1.000 2139.000 3.000 1.000 3.000 2.000 820.000 824.000
1.000 3.000 3824.000 4.000 3.000 1.000 1.000 954.000 958.000
5.000 4.000 2858.000 5.000 1.500 1.000 1.000 759.000 762.000
2.000 2.000 1827.000 7.000 3.000 1.000 1.000 735.000 738.000
2.000 2.000 3557.000 2.000 2.500 1.000 1.000 888.000 892.000
2.000 2.000 2553.000 2.000 2.500 2.000 2.000 884.000 889.000
4.000 1.000 1682.000 3.000 1.500 1.000 1.000 625.000 628.000
# FRAME: nrow = 5, ncol = 9
# zipcode district sqft numbedrooms numbathrooms floors view saleprice askingprice
# STRING STRING STRING STRING STRING STRING STRING STRING STRING
96334·4 south·2 FALSE·1
95141·1 east·4 TRUE·2
98755·5 north·3
94555·3 west·1
91312·2
district
column in CSV2 impacts the transform specification. More specifically, transform spec1 does not match the header in CSV2. To match, either remove the extra space before district
in CSV2 or use spec2 which quotes the district
token name to include the extra space.
transformdecode()
function can be used to transform a matrix
back into a frame
. Only recoding, dummy coding and pass-through transformations are reversible and can be used with transformdecode()
. The transformations binning, missing value imputation, and omit are not reversible and cannot be used with transformdecode()
.
The next example takes the outputs from the [transformencode](dml-language-reference.html#transformencode) example and reconstructs the original data using the same transformation specification.
F1 = read("/user/ml/homes.csv", data_type="frame", format="csv");
jspec = read("/user/ml/homes.tfspec_recode2.json", data_type="scalar", value_type="string");
[X, M] = transformencode(target=F1, spec=jspec);
F2 = transformdecode(target=X, spec=jspec, meta=M);
print(toString(F2));
# FRAME: nrow = 20, ncol = 9
# C1 C2 C3 C4 C5 C6 C7 C8 C9
# STRING STRING DOUBLE DOUBLE DOUBLE DOUBLE STRING DOUBLE DOUBLE
95141 west 1373.000 7.000 1.000 3.000 FALSE 695.000 698.000
91312 south 3261.000 6.000 2.000 2.000 FALSE 902.000 906.000
94555 north 1835.000 3.000 3.000 3.000 TRUE 888.000 892.000
95141 east 2833.000 6.000 2.500 2.000 TRUE 927.000 932.000
96334 south 2742.000 6.000 2.500 2.000 FALSE 872.000 876.000
96334 north 2195.000 5.000 2.500 2.000 FALSE 799.000 803.000
98755 north 3469.000 7.000 2.500 2.000 FALSE 958.000 963.000
96334 west 1685.000 7.000 1.500 2.000 TRUE 757.000 760.000
95141 west 2238.000 4.000 3.000 3.000 FALSE 894.000 899.000
91312 west 1245.000 4.000 1.000 1.000 FALSE 547.000 549.000
98755 south 3702.000 7.000 3.000 1.000 FALSE 959.000 964.000
98755 north 1865.000 7.000 1.000 2.000 TRUE 742.000 745.000
94555 north 3837.000 3.000 1.000 1.000 FALSE 839.000 842.000
91312 west 2139.000 3.000 1.000 3.000 TRUE 820.000 824.000
95141 north 3824.000 4.000 3.000 1.000 FALSE 954.000 958.000
98755 east 2858.000 5.000 1.500 1.000 FALSE 759.000 762.000
91312 south 1827.000 7.000 3.000 1.000 FALSE 735.000 738.000
91312 south 3557.000 2.000 2.500 1.000 FALSE 888.000 892.000
91312 south 2553.000 2.000 2.500 2.000 TRUE 884.000 889.000
96334 west 1682.000 3.000 1.500 1.000 FALSE 625.000 628.000
#### transformapply
In contrast to transformencode()
, which creates and applies frame metadata (transformencode := build+apply), transformapply()
applies *existing* metadata (transformapply := apply).
The following example uses transformapply()
with the input matrix and second output (i.e., existing frame metadata built with transformencode()
) from the [transformencode](dml-language-reference.html#transformencode) example for the [`homes.tfspec_bin2.json`](files/dml-language-reference/homes.tfspec_bin2.json) transformation specification.
{
"recode": [ zipcode, "district", "view" ], "bin": [
{ "name": "saleprice" , "method": "equi-width", "numbins": 3 }
,{ "name": "sqft", "method": "equi-width", "numbins": 4 }]
}
F1 = read("/user/ml/homes.csv", data_type="frame", format="csv");
jspec = read("/user/ml/homes.tfspec_bin2.json", data_type="scalar", value_type="string");
[X, M] = transformencode(target=F1, spec=jspec);
X2 = transformapply(target=F1, spec=jspec, meta=M);
print(toString(X2));
1.000 1.000 1.000 7.000 1.000 3.000 1.000 1.000 698.000
2.000 2.000 1.000 6.000 2.000 2.000 1.000 1.000 906.000
3.000 3.000 1.000 3.000 3.000 3.000 2.000 1.000 892.000
1.000 4.000 1.000 6.000 2.500 2.000 2.000 1.000 932.000
4.000 2.000 1.000 6.000 2.500 2.000 1.000 1.000 876.000
4.000 3.000 1.000 5.000 2.500 2.000 1.000 1.000 803.000
5.000 3.000 1.000 7.000 2.500 2.000 1.000 1.000 963.000
4.000 1.000 1.000 7.000 1.500 2.000 2.000 1.000 760.000
1.000 1.000 1.000 4.000 3.000 3.000 1.000 1.000 899.000
2.000 1.000 1.000 4.000 1.000 1.000 1.000 1.000 549.000
5.000 2.000 1.000 7.000 3.000 1.000 1.000 1.000 964.000
5.000 3.000 1.000 7.000 1.000 2.000 2.000 1.000 745.000
3.000 3.000 1.000 3.000 1.000 1.000 1.000 1.000 842.000
2.000 1.000 1.000 3.000 1.000 3.000 2.000 1.000 824.000
1.000 3.000 1.000 4.000 3.000 1.000 1.000 1.000 958.000
5.000 4.000 1.000 5.000 1.500 1.000 1.000 1.000 762.000
2.000 2.000 1.000 7.000 3.000 1.000 1.000 1.000 738.000
2.000 2.000 1.000 2.000 2.500 1.000 1.000 1.000 892.000
2.000 2.000 1.000 2.000 2.500 2.000 2.000 1.000 889.000
4.000 1.000 1.000 3.000 1.500 1.000 1.000 1.000 628.000
* * *
## Modules
A module is a collection of UDF declarations. For calling a module, source(...) and setwd(...) are used to read and use a source file.
### Syntax
setwd(