Package org.apache.sysds.resource.cost
Class IOCostUtils
- java.lang.Object
-
- org.apache.sysds.resource.cost.IOCostUtils
-
public class IOCostUtils extends Object
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
IOCostUtils.IOMetrics
-
Field Summary
Fields Modifier and Type Field Description static long
DEFAULT_FLOPS
-
Constructor Summary
Constructors Constructor Description IOCostUtils()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static String
getDataSource(String fileName)
Extracts the data source for a given file name: e.g.static double
getFileSystemReadTime(VarStats stats, IOCostUtils.IOMetrics metrics)
Estimates the read time for a file on HDFS or S3 by the Control Programstatic double
getFileSystemWriteTime(VarStats stats, IOCostUtils.IOMetrics metrics)
Estimates the time for writing a file to HDFS or S3.static double
getHadoopReadTime(VarStats stats, IOCostUtils.IOMetrics metrics)
Estimates the read time for a file on HDFS or S3 by Spark cluster.static double
getHadoopWriteTime(VarStats stats, IOCostUtils.IOMetrics metrics)
Estimates the write time for a file on HDFS or S3 by Spark cluster.static double
getMemReadTime(RDDStats stats, IOCostUtils.IOMetrics metrics)
Estimate time to scan distributed data sets in memory on Spark.static double
getMemReadTime(VarStats stats, IOCostUtils.IOMetrics metrics)
Estimate time to scan object in memory in CP.static double
getMemWriteTime(RDDStats stats, IOCostUtils.IOMetrics metrics)
Estimate time to write distributed data set on memory in CP.static double
getMemWriteTime(VarStats stats, IOCostUtils.IOMetrics metrics)
Estimate time to write object to memory in CP.static double
getSparkCollectTime(RDDStats output, IOCostUtils.IOMetrics driverMetrics, IOCostUtils.IOMetrics executorMetrics)
Estimates the time for collecting Spark Job output; The output RDD is transferred to the Spark driver at the end of each ResultStage; time = transfer time (overlaps and dominates the read and deserialization times);static double
getSparkParallelizeTime(RDDStats output, IOCostUtils.IOMetrics driverMetrics, IOCostUtils.IOMetrics executorMetrics)
Estimates the time to parallelize a local object to Spark.static double
getSparkShuffleReadStaticTime(RDDStats input, IOCostUtils.IOMetrics metrics)
Estimates the time for reading distributed RDD input at the beginning of a Stage when a wide-transformation is partition preserving: only local disk readsstatic double
getSparkShuffleReadTime(RDDStats input, IOCostUtils.IOMetrics metrics)
Estimates the time for reading distributed RDD input at the beginning of a Stage; time = transfer time (overlaps and dominates the read and deserialization times); For simplification it is assumed that the whole dataset is shuffled;static double
getSparkShuffleTime(RDDStats output, IOCostUtils.IOMetrics metrics, boolean withDistribution)
Combines the shuffle write and read time since these are being typically added in one place to the general data transmission for instruction estimation.static double
getSparkShuffleWriteTime(RDDStats output, IOCostUtils.IOMetrics metrics)
Estimates the time for writing the RDD output to the local system at the end of a ShuffleMapStage; time = disk write time (overlaps and dominates the serialization time) The whole data set is being written to shuffle files even if 1 executor is utilized;static boolean
isInvalidDataSource(String identifier)
-
-
-
Field Detail
-
DEFAULT_FLOPS
public static final long DEFAULT_FLOPS
- See Also:
- Constant Field Values
-
-
Method Detail
-
getMemReadTime
public static double getMemReadTime(VarStats stats, IOCostUtils.IOMetrics metrics)
Estimate time to scan object in memory in CP.- Parameters:
stats
- object statisticsmetrics
- CP node's metrics- Returns:
- estimated time in seconds
-
getMemReadTime
public static double getMemReadTime(RDDStats stats, IOCostUtils.IOMetrics metrics)
Estimate time to scan distributed data sets in memory on Spark. It integrates a mechanism to account for scanning spilled-over data sets on the local disk.- Parameters:
stats
- object statisticsmetrics
- CP node's metrics- Returns:
- estimated time in seconds
-
getMemWriteTime
public static double getMemWriteTime(VarStats stats, IOCostUtils.IOMetrics metrics)
Estimate time to write object to memory in CP.- Parameters:
stats
- object statisticsmetrics
- CP node's metrics- Returns:
- estimated time in seconds
-
getMemWriteTime
public static double getMemWriteTime(RDDStats stats, IOCostUtils.IOMetrics metrics)
Estimate time to write distributed data set on memory in CP. It does NOT integrate mechanism to account for spill-overs.- Parameters:
stats
- object statisticsmetrics
- CP node's metrics- Returns:
- estimated time in seconds
-
getFileSystemReadTime
public static double getFileSystemReadTime(VarStats stats, IOCostUtils.IOMetrics metrics)
Estimates the read time for a file on HDFS or S3 by the Control Program- Parameters:
stats
- stats for the input matrix/objectmetrics
- I/O metrics for the driver node- Returns:
- estimated time in seconds
-
getHadoopReadTime
public static double getHadoopReadTime(VarStats stats, IOCostUtils.IOMetrics metrics)
Estimates the read time for a file on HDFS or S3 by Spark cluster. It doesn't directly calculate the execution time regarding the object size but regarding full executor utilization and maximum block size to be read by an executor core (HDFS block size). The estimated time for "fully utilized" reading is then multiplied by the slot execution round since even not fully utilized, the last round should take approximately the same time as if all slots are assigned to an active reading task. This function cannot rely on theRDDStats
since they would not be initialized for the input object.- Parameters:
stats
- stats for the input matrix/objectmetrics
- I/O metrics for the executor node- Returns:
- estimated time in seconds
-
getFileSystemWriteTime
public static double getFileSystemWriteTime(VarStats stats, IOCostUtils.IOMetrics metrics)
Estimates the time for writing a file to HDFS or S3.- Parameters:
stats
- stats for the input matrix/objectmetrics
- I/O metrics for the driver node- Returns:
- estimated time in seconds
-
getHadoopWriteTime
public static double getHadoopWriteTime(VarStats stats, IOCostUtils.IOMetrics metrics)
Estimates the write time for a file on HDFS or S3 by Spark cluster. Follows the same logic asgetHadoopReadTime
, but here it can be relied on theRDDStats
since the input object should be initialized by the prior instruction- Parameters:
stats
- stats for the input matrix/objectmetrics
- I/O metrics for the executor node- Returns:
- estimated time in seconds
-
getSparkParallelizeTime
public static double getSparkParallelizeTime(RDDStats output, IOCostUtils.IOMetrics driverMetrics, IOCostUtils.IOMetrics executorMetrics)
Estimates the time to parallelize a local object to Spark.- Parameters:
output
- RDD statistics for the object to be collected/transferred.driverMetrics
- I/O metrics for the receiver - driver nodeexecutorMetrics
- I/O metrics for the executor nodes- Returns:
- estimated time in seconds
-
getSparkCollectTime
public static double getSparkCollectTime(RDDStats output, IOCostUtils.IOMetrics driverMetrics, IOCostUtils.IOMetrics executorMetrics)
Estimates the time for collecting Spark Job output; The output RDD is transferred to the Spark driver at the end of each ResultStage; time = transfer time (overlaps and dominates the read and deserialization times);- Parameters:
output
- RDD statistics for the object to be collected/transferred.driverMetrics
- I/O metrics for the receiver - driver nodeexecutorMetrics
- I/O metrics for the executor nodes- Returns:
- estimated time in seconds
-
getSparkShuffleReadTime
public static double getSparkShuffleReadTime(RDDStats input, IOCostUtils.IOMetrics metrics)
Estimates the time for reading distributed RDD input at the beginning of a Stage; time = transfer time (overlaps and dominates the read and deserialization times); For simplification it is assumed that the whole dataset is shuffled;- Parameters:
input
- RDD statistics for the object to be shuffled at the begging of a Stage.metrics
- I/O metrics for the executor nodes- Returns:
- estimated time in seconds
-
getSparkShuffleReadStaticTime
public static double getSparkShuffleReadStaticTime(RDDStats input, IOCostUtils.IOMetrics metrics)
Estimates the time for reading distributed RDD input at the beginning of a Stage when a wide-transformation is partition preserving: only local disk reads- Parameters:
input
- RDD statistics for the object to be shuffled (read) at the begging of a Stage.metrics
- I/O metrics for the executor nodes- Returns:
- estimated time in seconds
-
getSparkShuffleWriteTime
public static double getSparkShuffleWriteTime(RDDStats output, IOCostUtils.IOMetrics metrics)
Estimates the time for writing the RDD output to the local system at the end of a ShuffleMapStage; time = disk write time (overlaps and dominates the serialization time) The whole data set is being written to shuffle files even if 1 executor is utilized;- Parameters:
output
- RDD statistics for the output each ShuffleMapStagemetrics
- I/O metrics for the executor nodes- Returns:
- estimated time in seconds
-
getSparkShuffleTime
public static double getSparkShuffleTime(RDDStats output, IOCostUtils.IOMetrics metrics, boolean withDistribution)
Combines the shuffle write and read time since these are being typically added in one place to the general data transmission for instruction estimation.- Parameters:
output
- RDD statistics for the output each ShuffleMapStagemetrics
- I/O metrics for the executor nodeswithDistribution
- flag if the data is indeed reshuffled (default case), false in case of co-partitioned wide-transformation- Returns:
- estimated time in seconds
-
getDataSource
public static String getDataSource(String fileName)
Extracts the data source for a given file name: e.g. "hdfs" or "s3"- Parameters:
fileName
- filename to parse- Returns:
- data source type
-
isInvalidDataSource
public static boolean isInvalidDataSource(String identifier)
-
-