Class RDDAggregateUtils


  • public class RDDAggregateUtils
    extends Object
    Collection of utility methods for aggregating binary block rdds. As a general policy always call stable algorithms which maintain corrections over blocks per key. The performance overhead over a simple reducebykey is roughly 7-10% and with that acceptable.
    • Constructor Detail

      • RDDAggregateUtils

        public RDDAggregateUtils()
    • Method Detail

      • sumCellsByKeyStable

        public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​Double> sumCellsByKeyStable​(org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​Double> in)
      • sumCellsByKeyStable

        public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​Double> sumCellsByKeyStable​(org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​Double> in,
                                                                                                            int numParts)
      • aggStable

        public static MatrixBlock aggStable​(org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> in,
                                            AggregateOperator aop)
        Single block aggregation over pair rdds with corrections for numerical stability.
        Parameters:
        in - matrix as JavaPairRDD<MatrixIndexes, MatrixBlock>
        aop - aggregate operator
        Returns:
        matrix block
      • aggStable

        public static MatrixBlock aggStable​(org.apache.spark.api.java.JavaRDD<MatrixBlock> in,
                                            AggregateOperator aop)
        Single block aggregation over rdds with corrections for numerical stability.
        Parameters:
        in - matrix as JavaRDD<MatrixBlock>
        aop - aggregate operator
        Returns:
        matrix block
      • aggStableTensor

        public static TensorBlock aggStableTensor​(org.apache.spark.api.java.JavaPairRDD<TensorIndexes,​TensorBlock> in,
                                                  AggregateOperator aop)
        Single block aggregation over pair rdds with corrections for numerical stability.
        Parameters:
        in - tensor as JavaPairRDD<TensorIndexes, TensorBlock>
        aop - aggregate operator
        Returns:
        tensor block
      • aggStableTensor

        public static TensorBlock aggStableTensor​(org.apache.spark.api.java.JavaRDD<TensorBlock> in,
                                                  AggregateOperator aop)
        Single block aggregation over rdds with corrections for numerical stability.
        Parameters:
        in - tensor as JavaRDD<TensorBlock>
        aop - aggregate operator
        Returns:
        tensor block
      • mergeByKey

        public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> mergeByKey​(org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> in)
        Merges disjoint data of all blocks per key. Note: The behavior of this method is undefined for both sparse and dense data if the assumption of disjoint data is violated.
        Parameters:
        in - matrix as JavaPairRDD<MatrixIndexes, MatrixBlock>
        Returns:
        matrix as JavaPairRDD<MatrixIndexes, MatrixBlock>
      • mergeByKey

        public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> mergeByKey​(org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> in,
                                                                                                        boolean deepCopyCombiner)
        Merges disjoint data of all blocks per key. Note: The behavior of this method is undefined for both sparse and dense data if the assumption of disjoint data is violated.
        Parameters:
        in - matrix as JavaPairRDD<MatrixIndexes, MatrixBlock>
        deepCopyCombiner - indicator if the createCombiner functions needs to deep copy the input block
        Returns:
        matrix as JavaPairRDD<MatrixIndexes, MatrixBlock>
      • mergeByKey

        public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> mergeByKey​(org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> in,
                                                                                                        int numPartitions,
                                                                                                        boolean deepCopyCombiner)
        Merges disjoint data of all blocks per key. Note: The behavior of this method is undefined for both sparse and dense data if the assumption of disjoint data is violated.
        Parameters:
        in - matrix as JavaPairRDD<MatrixIndexes, MatrixBlock>
        numPartitions - number of output partitions
        deepCopyCombiner - indicator if the createCombiner functions needs to deep copy the input block
        Returns:
        matrix as JavaPairRDD<MatrixIndexes, MatrixBlock>
      • mergeRowsByKey

        public static org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​MatrixBlock> mergeRowsByKey​(org.apache.spark.api.java.JavaPairRDD<MatrixIndexes,​RowMatrixBlock> in)
        Merges disjoint data of all blocks per key. Note: The behavior of this method is undefined for both sparse and dense data if the assumption of disjoint data is violated.
        Parameters:
        in - matrix as JavaPairRDD<MatrixIndexes, RowMatrixBlock>
        Returns:
        matrix as JavaPairRDD<MatrixIndexes, MatrixBlock>