java.lang.Object
- org.apache.sysds.resource.cost.IOCostUtils

```
public class IOCostUtils
extends Object
```

Nested Class Summary

Nested Classes
Modifier and Type Class Description

static class IOCostUtils.IOMetrics

Field Summary

Fields
Modifier and Type Field Description

static long DEFAULT_FLOPS

Constructor Summary

Constructors
Constructor Description

IOCostUtils()

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method	Description
`static String`	`getDataSource(String fileName)`	Extracts the data source for a given file name: e.g.
`static double`	`getFileSystemReadTime(VarStats stats, IOCostUtils.IOMetrics metrics)`	Estimates the read time for a file on HDFS or S3 by the Control Program
`static double`	`getFileSystemWriteTime(VarStats stats, IOCostUtils.IOMetrics metrics)`	Estimates the time for writing a file to HDFS or S3.
`static double`	`getHadoopReadTime(VarStats stats, IOCostUtils.IOMetrics metrics)`	Estimates the read time for a file on HDFS or S3 by Spark cluster.
`static double`	`getHadoopWriteTime(VarStats stats, IOCostUtils.IOMetrics metrics)`	Estimates the write time for a file on HDFS or S3 by Spark cluster.
`static double`	`getMemReadTime(RDDStats stats, IOCostUtils.IOMetrics metrics)`	Estimate time to scan distributed data sets in memory on Spark.
`static double`	`getMemReadTime(VarStats stats, IOCostUtils.IOMetrics metrics)`	Estimate time to scan object in memory in CP.
`static double`	`getMemWriteTime(RDDStats stats, IOCostUtils.IOMetrics metrics)`	Estimate time to write distributed data set on memory in CP.
`static double`	`getMemWriteTime(VarStats stats, IOCostUtils.IOMetrics metrics)`	Estimate time to write object to memory in CP.
`static double`	`getSparkCollectTime(RDDStats output, IOCostUtils.IOMetrics driverMetrics, IOCostUtils.IOMetrics executorMetrics)`	Estimates the time for collecting Spark Job output; The output RDD is transferred to the Spark driver at the end of each ResultStage; time = transfer time (overlaps and dominates the read and deserialization times);
`static double`	`getSparkParallelizeTime(RDDStats output, IOCostUtils.IOMetrics driverMetrics, IOCostUtils.IOMetrics executorMetrics)`	Estimates the time to parallelize a local object to Spark.
`static double`	`getSparkShuffleReadStaticTime(RDDStats input, IOCostUtils.IOMetrics metrics)`	Estimates the time for reading distributed RDD input at the beginning of a Stage when a wide-transformation is partition preserving: only local disk reads
`static double`	`getSparkShuffleReadTime(RDDStats input, IOCostUtils.IOMetrics metrics)`	Estimates the time for reading distributed RDD input at the beginning of a Stage; time = transfer time (overlaps and dominates the read and deserialization times); For simplification it is assumed that the whole dataset is shuffled;
`static double`	`getSparkShuffleTime(RDDStats output, IOCostUtils.IOMetrics metrics, boolean withDistribution)`	Combines the shuffle write and read time since these are being typically added in one place to the general data transmission for instruction estimation.
`static double`	`getSparkShuffleWriteTime(RDDStats output, IOCostUtils.IOMetrics metrics)`	Estimates the time for writing the RDD output to the local system at the end of a ShuffleMapStage; time = disk write time (overlaps and dominates the serialization time) The whole data set is being written to shuffle files even if 1 executor is utilized;
`static boolean`	`isInvalidDataSource(String identifier)`

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - DEFAULT_FLOPS
```
public static final long DEFAULT_FLOPS
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - IOCostUtils
```
public IOCostUtils()
```
- Method Detail
  - getMemReadTime
```
public static double getMemReadTime(VarStats stats,
                                    IOCostUtils.IOMetrics metrics)
```
    Estimate time to scan object in memory in CP.
    
    Parameters:
    
    stats - object statistics
    
    metrics - CP node's metrics
    
    Returns:
    
    estimated time in seconds
  - getMemReadTime
```
public static double getMemReadTime(RDDStats stats,
                                    IOCostUtils.IOMetrics metrics)
```
    Estimate time to scan distributed data sets in memory on Spark. It integrates a mechanism to account for scanning spilled-over data sets on the local disk.
    
    Parameters:
    
    stats - object statistics
    
    metrics - CP node's metrics
    
    Returns:
    
    estimated time in seconds
  - getMemWriteTime
```
public static double getMemWriteTime(VarStats stats,
                                     IOCostUtils.IOMetrics metrics)
```
    Estimate time to write object to memory in CP.
    
    Parameters:
    
    stats - object statistics
    
    metrics - CP node's metrics
    
    Returns:
    
    estimated time in seconds
  - getMemWriteTime
```
public static double getMemWriteTime(RDDStats stats,
                                     IOCostUtils.IOMetrics metrics)
```
    Estimate time to write distributed data set on memory in CP. It does NOT integrate mechanism to account for spill-overs.
    
    Parameters:
    
    stats - object statistics
    
    metrics - CP node's metrics
    
    Returns:
    
    estimated time in seconds
  - getFileSystemReadTime
```
public static double getFileSystemReadTime(VarStats stats,
                                           IOCostUtils.IOMetrics metrics)
```
    Estimates the read time for a file on HDFS or S3 by the Control Program
    
    Parameters:
    
    stats - stats for the input matrix/object
    
    metrics - I/O metrics for the driver node
    
    Returns:
    
    estimated time in seconds
  - getHadoopReadTime
```
public static double getHadoopReadTime(VarStats stats,
                                       IOCostUtils.IOMetrics metrics)
```
    Estimates the read time for a file on HDFS or S3 by Spark cluster. It doesn't directly calculate the execution time regarding the object size but regarding full executor utilization and maximum block size to be read by an executor core (HDFS block size). The estimated time for "fully utilized" reading is then multiplied by the slot execution round since even not fully utilized, the last round should take approximately the same time as if all slots are assigned to an active reading task. This function cannot rely on the RDDStats since they would not be initialized for the input object.
    
    Parameters:
    
    stats - stats for the input matrix/object
    
    metrics - I/O metrics for the executor node
    
    Returns:
    
    estimated time in seconds
  - getFileSystemWriteTime
```
public static double getFileSystemWriteTime(VarStats stats,
                                            IOCostUtils.IOMetrics metrics)
```
    Estimates the time for writing a file to HDFS or S3.
    
    Parameters:
    
    stats - stats for the input matrix/object
    
    metrics - I/O metrics for the driver node
    
    Returns:
    
    estimated time in seconds
  - getHadoopWriteTime
```
public static double getHadoopWriteTime(VarStats stats,
                                        IOCostUtils.IOMetrics metrics)
```
    Estimates the write time for a file on HDFS or S3 by Spark cluster. Follows the same logic as getHadoopReadTime, but here it can be relied on the RDDStats since the input object should be initialized by the prior instruction
    
    Parameters:
    
    stats - stats for the input matrix/object
    
    metrics - I/O metrics for the executor node
    
    Returns:
    
    estimated time in seconds
  - getSparkParallelizeTime
```
public static double getSparkParallelizeTime(RDDStats output,
                                             IOCostUtils.IOMetrics driverMetrics,
                                             IOCostUtils.IOMetrics executorMetrics)
```
    Estimates the time to parallelize a local object to Spark.
    
    Parameters:
    
    output - RDD statistics for the object to be collected/transferred.
    
    driverMetrics - I/O metrics for the receiver - driver node
    
    executorMetrics - I/O metrics for the executor nodes
    
    Returns:
    
    estimated time in seconds
  - getSparkCollectTime
```
public static double getSparkCollectTime(RDDStats output,
                                         IOCostUtils.IOMetrics driverMetrics,
                                         IOCostUtils.IOMetrics executorMetrics)
```
    Estimates the time for collecting Spark Job output; The output RDD is transferred to the Spark driver at the end of each ResultStage; time = transfer time (overlaps and dominates the read and deserialization times);
    
    Parameters:
    
    output - RDD statistics for the object to be collected/transferred.
    
    driverMetrics - I/O metrics for the receiver - driver node
    
    executorMetrics - I/O metrics for the executor nodes
    
    Returns:
    
    estimated time in seconds
  - getSparkShuffleReadTime
```
public static double getSparkShuffleReadTime(RDDStats input,
                                             IOCostUtils.IOMetrics metrics)
```
    Estimates the time for reading distributed RDD input at the beginning of a Stage; time = transfer time (overlaps and dominates the read and deserialization times); For simplification it is assumed that the whole dataset is shuffled;
    
    Parameters:
    
    input - RDD statistics for the object to be shuffled at the begging of a Stage.
    
    metrics - I/O metrics for the executor nodes
    
    Returns:
    
    estimated time in seconds
  - getSparkShuffleReadStaticTime
```
public static double getSparkShuffleReadStaticTime(RDDStats input,
                                                   IOCostUtils.IOMetrics metrics)
```
    Estimates the time for reading distributed RDD input at the beginning of a Stage when a wide-transformation is partition preserving: only local disk reads
    
    Parameters:
    
    input - RDD statistics for the object to be shuffled (read) at the begging of a Stage.
    
    metrics - I/O metrics for the executor nodes
    
    Returns:
    
    estimated time in seconds
  - getSparkShuffleWriteTime
```
public static double getSparkShuffleWriteTime(RDDStats output,
                                              IOCostUtils.IOMetrics metrics)
```
    Estimates the time for writing the RDD output to the local system at the end of a ShuffleMapStage; time = disk write time (overlaps and dominates the serialization time) The whole data set is being written to shuffle files even if 1 executor is utilized;
    
    Parameters:
    
    output - RDD statistics for the output each ShuffleMapStage
    
    metrics - I/O metrics for the executor nodes
    
    Returns:
    
    estimated time in seconds
  - getSparkShuffleTime
```
public static double getSparkShuffleTime(RDDStats output,
                                         IOCostUtils.IOMetrics metrics,
                                         boolean withDistribution)
```
    Combines the shuffle write and read time since these are being typically added in one place to the general data transmission for instruction estimation.
    
    Parameters:
    
    output - RDD statistics for the output each ShuffleMapStage
    
    metrics - I/O metrics for the executor nodes
    
    withDistribution - flag if the data is indeed reshuffled (default case), false in case of co-partitioned wide-transformation
    
    Returns:
    
    estimated time in seconds
  - getDataSource
```
public static String getDataSource(String fileName)
```
    Extracts the data source for a given file name: e.g. "hdfs" or "s3"
    
    Parameters:
    
    fileName - filename to parse
    
    Returns:
    
    data source type
  - isInvalidDataSource
```
public static boolean isInvalidDataSource(String identifier)
```

Class IOCostUtils

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

DEFAULT_FLOPS

Constructor Detail

IOCostUtils

Method Detail

getMemReadTime

getMemReadTime

getMemWriteTime

getMemWriteTime

getFileSystemReadTime

getHadoopReadTime

getFileSystemWriteTime

getHadoopWriteTime

getSparkParallelizeTime

getSparkCollectTime

getSparkShuffleReadTime

getSparkShuffleReadStaticTime

getSparkShuffleWriteTime

getSparkShuffleTime

getDataSource

isInvalidDataSource