There are various forms of parallelism. Figure 1 shows four simple kinds of parallelism graphically. A few key ideas can be derived from applying these parallelism structures to problems in intensive data processing:
Figure 1. Types of parallelism
- Dividing problems is the essence of parallelism. Division into independent subproblems gives independent parallelism, while dividing into incremental computations gives pipeline parallelism. Set mappings naturally adapt to independent parallelism (the same instruction is independently applied to each element of a set) while stream mappings adapt to pipeline parallelism(some instructions are successively applied to each element of a stream).Thus, sets and streams suggest a divide-and-conquer format for specifying mappings which is implicitly also a format for specifying parallelism.
- Divide-and-conquer computations can be represented by combining these types of parallelism. “Dividing” a problem is represented by fan-out nodes in the graph, while conquering gathers results into a set (with independent parallelism), a stream (with pipeline parallelism), and/or an aggregate (with fan-in parallelism). Thus, divide-and-conquer solutions of problems naturally capture these kinds of parallelism.
- Relational algebra operators can often be naturally expressed as divide-and-conquer computations.
These ideas raise hope for a parallel data processing system that rests upon divide-and-conquer techniques. However, such a system must deal with several technical issues to be viable. A first problem is that the relational model offers no way to talk about order among data (e.g., sorted relations, or ordered tuples). Relational languages are therefore inadequate for specifying “stream processing” in which ordered sequences of data are processed sequentially. Hence, streams cannot be exploited to specify pipeline parallelism following a data-parallelism paradigm.Pipeline parallelism is generally used, transparently to the user, in lower-level languages implementing relational algebra (e.g., PLERA or PFAD).
A second problem is that parallel data processing requires effective data partitioning capabilities. Typically, a relational query (select-project-join expression)is translated into a low-level form of relational algebra with explicit (low-level)parallel constructs. Data partitioning is used to spread the computation of relational algebra operators among parallel processors. This partitioning is typically defined during the physical database design and then exploited by a compiler. Most of the time, a partitioned computation requires that processors exchange intermediate results in order to compute the final result.
Ideally, data partitioning must be expressible by the programmer or the compiler within a parallel database language. This is essential to automatically extract parallelism and lead to efficient implementations on parallel database systems.Specifying parallel computations over relations often requires specifying how data partitioning (fan-out parallelism) will be done and how distributed results will be collected (fan-in parallelism). Database models have been developed before that permit expression of both ordering among tuples and data partitioning. For example, the FAD language of Bubba has operators that express various forms of fan-out and fan-in parallelism. FAD is a strongly typed set-oriented database language based on functional programming and relational algebra. It provides a fixed set of higher-order functions to aggregate functions, like the pump parametrized aggregate operator and the grouping operator. The pump operator applies a unary function to each element of a set, producing an intermediate set which is then “reduced” to a single datum using a binary function that combines the intermediate set elements. Indeed, pump naturally expresses a special case of fan-out and fan-in parallelism. At the same time, the group operator permits set partitioning.
The SVP model goes one step further in allowing sets, streams and parallelism to be captured in a unified framework formalizing divide-and-conquer mappings. SVP models collections, whose specialization leads to sets or streams, as series-parallel graphs which ease expressing parallel data processing. An important class of queries called transducers generalizes aggregate operations and set or stream operations. They lead to high-level specification of independent and pipeline parallelism. Thus, SVP is a possible formal foundation for further research in parallel data programming languages.