Class IOCostUtils


  • public class IOCostUtils
    extends Object
    • Constructor Detail

      • IOCostUtils

        public IOCostUtils()
    • Method Detail

      • getMemReadTime

        public static double getMemReadTime​(VarStats stats,
                                            IOCostUtils.IOMetrics metrics)
        Estimate time to scan object in memory in CP.
        Parameters:
        stats - object statistics
        metrics - CP node's metrics
        Returns:
        estimated time in seconds
      • getMemReadTime

        public static double getMemReadTime​(RDDStats stats,
                                            IOCostUtils.IOMetrics metrics)
        Estimate time to scan distributed data sets in memory on Spark. It integrates a mechanism to account for scanning spilled-over data sets on the local disk.
        Parameters:
        stats - object statistics
        metrics - CP node's metrics
        Returns:
        estimated time in seconds
      • getMemWriteTime

        public static double getMemWriteTime​(VarStats stats,
                                             IOCostUtils.IOMetrics metrics)
        Estimate time to write object to memory in CP.
        Parameters:
        stats - object statistics
        metrics - CP node's metrics
        Returns:
        estimated time in seconds
      • getMemWriteTime

        public static double getMemWriteTime​(RDDStats stats,
                                             IOCostUtils.IOMetrics metrics)
        Estimate time to write distributed data set on memory in CP. It does NOT integrate mechanism to account for spill-overs.
        Parameters:
        stats - object statistics
        metrics - CP node's metrics
        Returns:
        estimated time in seconds
      • getFileSystemReadTime

        public static double getFileSystemReadTime​(VarStats stats,
                                                   IOCostUtils.IOMetrics metrics)
        Estimates the read time for a file on HDFS or S3 by the Control Program
        Parameters:
        stats - stats for the input matrix/object
        metrics - I/O metrics for the driver node
        Returns:
        estimated time in seconds
      • getHadoopReadTime

        public static double getHadoopReadTime​(VarStats stats,
                                               IOCostUtils.IOMetrics metrics)
        Estimates the read time for a file on HDFS or S3 by Spark cluster. It doesn't directly calculate the execution time regarding the object size but regarding full executor utilization and maximum block size to be read by an executor core (HDFS block size). The estimated time for "fully utilized" reading is then multiplied by the slot execution round since even not fully utilized, the last round should take approximately the same time as if all slots are assigned to an active reading task. This function cannot rely on the RDDStats since they would not be initialized for the input object.
        Parameters:
        stats - stats for the input matrix/object
        metrics - I/O metrics for the executor node
        Returns:
        estimated time in seconds
      • getFileSystemWriteTime

        public static double getFileSystemWriteTime​(VarStats stats,
                                                    IOCostUtils.IOMetrics metrics)
        Estimates the time for writing a file to HDFS or S3.
        Parameters:
        stats - stats for the input matrix/object
        metrics - I/O metrics for the driver node
        Returns:
        estimated time in seconds
      • getHadoopWriteTime

        public static double getHadoopWriteTime​(VarStats stats,
                                                IOCostUtils.IOMetrics metrics)
        Estimates the write time for a file on HDFS or S3 by Spark cluster. Follows the same logic as getHadoopReadTime, but here it can be relied on the RDDStats since the input object should be initialized by the prior instruction
        Parameters:
        stats - stats for the input matrix/object
        metrics - I/O metrics for the executor node
        Returns:
        estimated time in seconds
      • getSparkParallelizeTime

        public static double getSparkParallelizeTime​(RDDStats output,
                                                     IOCostUtils.IOMetrics driverMetrics,
                                                     IOCostUtils.IOMetrics executorMetrics)
        Estimates the time to parallelize a local object to Spark.
        Parameters:
        output - RDD statistics for the object to be collected/transferred.
        driverMetrics - I/O metrics for the receiver - driver node
        executorMetrics - I/O metrics for the executor nodes
        Returns:
        estimated time in seconds
      • getSparkCollectTime

        public static double getSparkCollectTime​(RDDStats output,
                                                 IOCostUtils.IOMetrics driverMetrics,
                                                 IOCostUtils.IOMetrics executorMetrics)
        Estimates the time for collecting Spark Job output; The output RDD is transferred to the Spark driver at the end of each ResultStage; time = transfer time (overlaps and dominates the read and deserialization times);
        Parameters:
        output - RDD statistics for the object to be collected/transferred.
        driverMetrics - I/O metrics for the receiver - driver node
        executorMetrics - I/O metrics for the executor nodes
        Returns:
        estimated time in seconds
      • getSparkShuffleReadTime

        public static double getSparkShuffleReadTime​(RDDStats input,
                                                     IOCostUtils.IOMetrics metrics)
        Estimates the time for reading distributed RDD input at the beginning of a Stage; time = transfer time (overlaps and dominates the read and deserialization times); For simplification it is assumed that the whole dataset is shuffled;
        Parameters:
        input - RDD statistics for the object to be shuffled at the begging of a Stage.
        metrics - I/O metrics for the executor nodes
        Returns:
        estimated time in seconds
      • getSparkShuffleReadStaticTime

        public static double getSparkShuffleReadStaticTime​(RDDStats input,
                                                           IOCostUtils.IOMetrics metrics)
        Estimates the time for reading distributed RDD input at the beginning of a Stage when a wide-transformation is partition preserving: only local disk reads
        Parameters:
        input - RDD statistics for the object to be shuffled (read) at the begging of a Stage.
        metrics - I/O metrics for the executor nodes
        Returns:
        estimated time in seconds
      • getSparkShuffleWriteTime

        public static double getSparkShuffleWriteTime​(RDDStats output,
                                                      IOCostUtils.IOMetrics metrics)
        Estimates the time for writing the RDD output to the local system at the end of a ShuffleMapStage; time = disk write time (overlaps and dominates the serialization time) The whole data set is being written to shuffle files even if 1 executor is utilized;
        Parameters:
        output - RDD statistics for the output each ShuffleMapStage
        metrics - I/O metrics for the executor nodes
        Returns:
        estimated time in seconds
      • getSparkShuffleTime

        public static double getSparkShuffleTime​(RDDStats output,
                                                 IOCostUtils.IOMetrics metrics,
                                                 boolean withDistribution)
        Combines the shuffle write and read time since these are being typically added in one place to the general data transmission for instruction estimation.
        Parameters:
        output - RDD statistics for the output each ShuffleMapStage
        metrics - I/O metrics for the executor nodes
        withDistribution - flag if the data is indeed reshuffled (default case), false in case of co-partitioned wide-transformation
        Returns:
        estimated time in seconds
      • getDataSource

        public static String getDataSource​(String fileName)
        Extracts the data source for a given file name: e.g. "hdfs" or "s3"
        Parameters:
        fileName - filename to parse
        Returns:
        data source type
      • isInvalidDataSource

        public static boolean isInvalidDataSource​(String identifier)