RDDConverterUtils (SystemML 0.15.0 API)

java.lang.Object
- org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtils

public class RDDConverterUtils
extends Object

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class RDDConverterUtils.DataFrameExtractIDFunction

Nested Classes
Modifier and Type	Class and Description
`static class`	`RDDConverterUtils.DataFrameExtractIDFunction`

Field Summary

Fields
Modifier and Type Field and Description

static String DF_ID_COLUMN

Fields
Modifier and Type	Field and Description
`static String`	`DF_ID_COLUMN`

Constructor Summary

Constructors
Constructor and Description

RDDConverterUtils()

Constructors
Constructor and Description
`RDDConverterUtils()`

Method Summary

All Methods Static Methods Concrete Methods Deprecated Methods
Modifier and Type	Method and Description
`static org.apache.spark.api.java.JavaRDD<String>`	`binaryBlockToCsv(org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> in, MatrixCharacteristics mcIn, CSVFileFormatProperties props, boolean strict)`
`static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>`	`binaryBlockToDataFrame(org.apache.spark.sql.SparkSession sparkSession, org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> in, MatrixCharacteristics mc, boolean toVector)`
`static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>`	`binaryBlockToDataFrame(org.apache.spark.sql.SQLContext sqlContext, org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> in, MatrixCharacteristics mc, boolean toVector)` Deprecated.
`static org.apache.spark.api.java.JavaRDD<org.apache.spark.ml.feature.LabeledPoint>`	`binaryBlockToLabeledPoints(org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> in)` Converter from binary block rdd to rdd of labeled points.
`static org.apache.spark.api.java.JavaRDD<String>`	`binaryBlockToTextCell(org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> in, MatrixCharacteristics mc)`
`static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock>`	`binaryCellToBinaryBlock(org.apache.spark.api.java.JavaSparkContext sc, org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixCell> input, MatrixCharacteristics mcOut, boolean outputEmptyBlocks)`
`static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock>`	`csvToBinaryBlock(org.apache.spark.api.java.JavaSparkContext sc, org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text> input, MatrixCharacteristics mc, boolean hasHeader, String delim, boolean fill, double fillValue)`
`static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock>`	`csvToBinaryBlock(org.apache.spark.api.java.JavaSparkContext sc, org.apache.spark.api.java.JavaRDD<String> input, MatrixCharacteristics mcOut, boolean hasHeader, String delim, boolean fill, double fillValue)` Example usage:
`static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock>`	`dataFrameToBinaryBlock(org.apache.spark.api.java.JavaSparkContext sc, org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df, MatrixCharacteristics mc, boolean containsID, boolean isVector)`
`static void`	`libsvmToBinaryBlock(org.apache.spark.api.java.JavaSparkContext sc, String pathIn, String pathX, String pathY, MatrixCharacteristics mcOutX)` Converts a libsvm text input file into two binary block matrices for features and labels, and saves these to the specified output files.
`static org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text>`	`stringToSerializableText(org.apache.spark.api.java.JavaPairRDD<Long,String> in)`
`static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock>`	`textCellToBinaryBlock(org.apache.spark.api.java.JavaSparkContext sc, org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text> input, MatrixCharacteristics mcOut, boolean outputEmptyBlocks)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail
- DF_ID_COLUMN
```
public static final String DF_ID_COLUMN
```
  See Also:
  
  Constant Field Values

Constructor Detail
- RDDConverterUtils
```
public RDDConverterUtils()
```

Method Detail

textCellToBinaryBlock

public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> textCellToBinaryBlock(org.apache.spark.api.java.JavaSparkContext sc,
                                                                                                     org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text> input,
                                                                                                     MatrixCharacteristics mcOut,
                                                                                                     boolean outputEmptyBlocks)
                                                                                              throws DMLRuntimeException

Throws:: DMLRuntimeException

binaryCellToBinaryBlock

public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> binaryCellToBinaryBlock(org.apache.spark.api.java.JavaSparkContext sc,
                                                                                                       org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixCell> input,
                                                                                                       MatrixCharacteristics mcOut,
                                                                                                       boolean outputEmptyBlocks)
                                                                                                throws DMLRuntimeException

Throws:: DMLRuntimeException

binaryBlockToLabeledPoints
```
public static org.apache.spark.api.java.JavaRDD<org.apache.spark.ml.feature.LabeledPoint> binaryBlockToLabeledPoints(org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> in)
```
Converter from binary block rdd to rdd of labeled points. Note that the input needs to be reblocked to satisfy the 'clen <= bclen' constraint.

Parameters:

in - matrix as JavaPairRDD<MatrixIndexes, MatrixBlock>

Returns:

JavaRDD of labeled points

binaryBlockToTextCell

public static org.apache.spark.api.java.JavaRDD<String> binaryBlockToTextCell(org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> in,
                                                                              MatrixCharacteristics mc)

binaryBlockToCsv

public static org.apache.spark.api.java.JavaRDD<String> binaryBlockToCsv(org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> in,
                                                                         MatrixCharacteristics mcIn,
                                                                         CSVFileFormatProperties props,
                                                                         boolean strict)

csvToBinaryBlock

public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> csvToBinaryBlock(org.apache.spark.api.java.JavaSparkContext sc,
                                                                                                org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text> input,
                                                                                                MatrixCharacteristics mc,
                                                                                                boolean hasHeader,
                                                                                                String delim,
                                                                                                boolean fill,
                                                                                                double fillValue)
                                                                                         throws DMLRuntimeException

Throws:: DMLRuntimeException

csvToBinaryBlock

public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> csvToBinaryBlock(org.apache.spark.api.java.JavaSparkContext sc,
                                                                                                org.apache.spark.api.java.JavaRDD<String> input,
                                                                                                MatrixCharacteristics mcOut,
                                                                                                boolean hasHeader,
                                                                                                String delim,
                                                                                                boolean fill,
                                                                                                double fillValue)
                                                                                         throws DMLRuntimeException

Example usage:


 import org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtils
 import org.apache.sysml.runtime.matrix.MatrixCharacteristics
 import org.apache.spark.api.java.JavaSparkContext
 val A = sc.textFile("ranA.csv")
 val Amc = new MatrixCharacteristics
 val Abin = RDDConverterUtils.csvToBinaryBlock(new JavaSparkContext(sc), A, Amc, false, ",", false, 0)

Parameters:: sc - java spark context; input - rdd of strings; mcOut - matrix characteristics; hasHeader - if true, has header; delim - delimiter as a string; fill - if true, fill in empty values with fillValue; fillValue - fill value used to fill empty values
Returns:: matrix as JavaPairRDD<MatrixIndexes, MatrixBlock>
Throws:: DMLRuntimeException - if DMLRuntimeException occurs

dataFrameToBinaryBlock

public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> dataFrameToBinaryBlock(org.apache.spark.api.java.JavaSparkContext sc,
                                                                                                      org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df,
                                                                                                      MatrixCharacteristics mc,
                                                                                                      boolean containsID,
                                                                                                      boolean isVector)

binaryBlockToDataFrame

public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> binaryBlockToDataFrame(org.apache.spark.sql.SparkSession sparkSession,
                                                                                            org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> in,
                                                                                            MatrixCharacteristics mc,
                                                                                            boolean toVector)

binaryBlockToDataFrame

@Deprecated
public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> binaryBlockToDataFrame(org.apache.spark.sql.SQLContext sqlContext,
                                                                                                         org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,MatrixBlock> in,
                                                                                                         MatrixCharacteristics mc,
                                                                                                         boolean toVector)

Deprecated.

libsvmToBinaryBlock
```
public static void libsvmToBinaryBlock(org.apache.spark.api.java.JavaSparkContext sc,
                                       String pathIn,
                                       String pathX,
                                       String pathY,
                                       MatrixCharacteristics mcOutX)
                                throws DMLRuntimeException
```
Converts a libsvm text input file into two binary block matrices for features and labels, and saves these to the specified output files. This call also deletes existing files at the specified output locations, as well as determines and writes the meta data files of both output matrices.
Note: We use org.apache.spark.mllib.util.MLUtils.loadLibSVMFile for parsing the libsvm input files in order to ensure consistency with Spark.

Parameters:

sc - java spark context

pathIn - path to libsvm input file

pathX - path to binary block output file of features

pathY - path to binary block output file of labels

mcOutX - matrix characteristics of output matrix X

Throws:

DMLRuntimeException - if output path not writable or conversion failure

stringToSerializableText

public static org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.Text> stringToSerializableText(org.apache.spark.api.java.JavaPairRDD<Long,String> in)

Methods inherited from class java.lang.Object

DF_ID_COLUMN

RDDConverterUtils

textCellToBinaryBlock

binaryCellToBinaryBlock

binaryBlockToLabeledPoints

binaryBlockToTextCell

binaryBlockToCsv

csvToBinaryBlock

csvToBinaryBlock

dataFrameToBinaryBlock

binaryBlockToDataFrame

binaryBlockToDataFrame

libsvmToBinaryBlock

stringToSerializableText