SystemDS Architecture
Algorithms in Apache SystemDS are written in a high-level R-like language called Declarative Machine learning Language (DML)
or a high-level Python-like language called PyDML.
SystemDS compiles and optimizes these algorithms into hybrid runtime
plans of multi-threaded, in-memory operations on a single node (scale-up) and distributed Spark operations on
a cluster of nodes (scale-out). SystemDS's high-level architecture consists of the following components:
Language
DML (with either R- or Python-like syntax) provides linear algebra primitives, a rich set of statistical
functions and matrix manipulations,
as well as user-defined and external functions, control structures including parfor loops, and recursion.
The user provides the DML script through one of the following APIs:
- Command-line interface (
DMLScript
)
- Convenient programmatic interface for Spark users (
MLContext
)
- Java Machine Learning Connector API (
Connection
)
ParserWrapper
performs syntactic validation and
parses the input DML script using ANTLR into a
a hierarchy of
StatementBlock
and
Statement
as defined by control structures.
Another important class of the language component is
DMLTranslator
which performs live variable analysis and semantic validation.
During that process we also retrieve input data characteristics -- i.e., format,
number of rows, columns, and non-zero values -- as well as
infrastructure characteristics, which are used for subsequent
optimizations. Finally, we construct directed acyclic graphs (DAGs)
of high-level operators (
Hop
) per statement block.
Optimizer
The SystemDS optimizer works over programs of HOP DAGs, where HOPs are operators on
matrices or scalars, and are categorized according to their
access patterns. Examples are matrix multiplications, unary
aggregates like rowSums(), binary operations like cell-wise
matrix additions, reorganization operations like transpose or
sort, and more specific operations. We perform various optimizations
on these HOP DAGs, including algebraic simplification rewrites (
ProgramRewriter
),
intra-/
InterProceduralAnalysis
for statistics propagation into functions and over entire programs, and
operator ordering of matrix multiplication chains. We compute
memory estimates for all HOPs, reflecting the memory
requirements of in-memory single-node operations and
intermediates. Each HOP DAG is compiled to a DAG of
low-level operators (
Lop
) such as grouping and aggregate,
which are backend-specific physical operators. Operator selection
picks the best physical operators for a given HOP
based on memory estimates, data, and cluster characteristics.
Individual LOPs have corresponding runtime implementations,
called instructions, and the optimizer generates
an executable runtime program of instructions.
Runtime
We execute the generated runtime program locally
in CP (control program), i.e., within a driver process.
This driver handles recompilation, runs in-memory single-node
CPInstruction
(some of which are multi-threaded ),
maintains an in-memory buffer pool, and launches Spark jobs if the runtime plan contains distributed computations
in the form of Spark instructions (
SPInstruction
).
For the Spark backend, we rely on Spark's lazy evaluation and stage construction.
CP instructions may also be backed by GPU kernels (
GPUInstruction
).
The multi-level buffer pool caches local matrices in-memory,
evicts them if necessary, and handles data exchange between
local and distributed runtime backends.
The core of SystemDS's runtime instructions is an adaptive matrix block library,
which is sparsity-aware and operates on the entire matrix in CP, or blocks of a matrix in a distributed setting. Further
key features include parallel for-loops for task-parallel
computations, and dynamic recompilation for runtime plan adaptation addressing initial unknowns.