Class RDDConverterUtils


  • public class RDDConverterUtils
    extends Object
    • Constructor Detail

      • RDDConverterUtils

        public RDDConverterUtils()
    • Method Detail

      • textCellToBinaryBlock

        public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> textCellToBinaryBlock​(org.apache.spark.api.java.JavaSparkContext sc,
                                                                                                                   org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.LongWritable,​org.apache.hadoop.io.Text> input,
                                                                                                                   DataCharacteristics mcOut,
                                                                                                                   boolean outputEmptyBlocks,
                                                                                                                   FileFormatPropertiesMM mmProps)
      • binaryBlockToLabeledPoints

        public static org.apache.spark.api.java.JavaRDD<org.apache.spark.ml.feature.LabeledPoint> binaryBlockToLabeledPoints​(org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> in)
        Converter from binary block rdd to rdd of labeled points. Note that the input needs to be reblocked to satisfy the 'clen <= blen' constraint.
        Parameters:
        in - matrix as JavaPairRDD<MatrixIndexes, MatrixBlock>
        Returns:
        JavaRDD of labeled points
      • csvToBinaryBlock

        public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> csvToBinaryBlock​(org.apache.spark.api.java.JavaSparkContext sc,
                                                                                                              org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.LongWritable,​org.apache.hadoop.io.Text> input,
                                                                                                              DataCharacteristics mc,
                                                                                                              boolean hasHeader,
                                                                                                              String delim,
                                                                                                              boolean fill,
                                                                                                              double fillValue,
                                                                                                              Set<String> naStrings)
      • csvToBinaryBlock

        public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> csvToBinaryBlock​(org.apache.spark.api.java.JavaSparkContext sc,
                                                                                                              org.apache.spark.api.java.JavaRDD<String> input,
                                                                                                              DataCharacteristics mcOut,
                                                                                                              boolean hasHeader,
                                                                                                              String delim,
                                                                                                              boolean fill,
                                                                                                              double fillValue,
                                                                                                              Set<String> naStrings)
      • dataFrameToBinaryBlock

        public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> dataFrameToBinaryBlock​(org.apache.spark.api.java.JavaSparkContext sc,
                                                                                                                    org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df,
                                                                                                                    DataCharacteristics mc,
                                                                                                                    boolean containsID,
                                                                                                                    boolean isVector)
      • binaryBlockToDataFrame

        public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> binaryBlockToDataFrame​(org.apache.spark.sql.SparkSession sparkSession,
                                                                                                    org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> in,
                                                                                                    DataCharacteristics mc,
                                                                                                    boolean toVector)
      • binaryBlockToDataFrame

        @Deprecated
        public static org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> binaryBlockToDataFrame​(org.apache.spark.sql.SQLContext sqlContext,
                                                                                                    org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> in,
                                                                                                    DataCharacteristics mc,
                                                                                                    boolean toVector)
        Deprecated.
      • libsvmToBinaryBlock

        public static void libsvmToBinaryBlock​(org.apache.spark.api.java.JavaSparkContext sc,
                                               String pathIn,
                                               String pathX,
                                               String pathY,
                                               DataCharacteristics mcOutX)
        Converts a libsvm text input file into two binary block matrices for features and labels, and saves these to the specified output files. This call also deletes existing files at the specified output locations, as well as determines and writes the meta data files of both output matrices.

        Note: We use org.apache.spark.mllib.util.MLUtils.loadLibSVMFile for parsing the libsvm input files in order to ensure consistency with Spark.

        Parameters:
        sc - java spark context
        pathIn - path to libsvm input file
        pathX - path to binary block output file of features
        pathY - path to binary block output file of labels
        mcOutX - matrix characteristics of output matrix X
      • stringToSerializableText

        public static org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.LongWritable,​org.apache.hadoop.io.Text> stringToSerializableText​(org.apache.spark.api.java.JavaPairRDD<Long,​String> in)
      • libsvmToBinaryBlock

        public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> libsvmToBinaryBlock​(org.apache.spark.api.java.JavaSparkContext sc,
                                                                                                                 org.apache.spark.api.java.JavaPairRDD<org.apache.hadoop.io.LongWritable,​org.apache.hadoop.io.Text> input,
                                                                                                                 DataCharacteristics mc,
                                                                                                                 String delim,
                                                                                                                 String indexDelim)