Posts Tagged ‘Parallel database

Parallel database languages

There are various forms of parallelism. Figure 1 shows four simple kinds of parallelism graphically. A few key ideas can be derived from applying these parallelism structures to problems in intensive data processing:

Figure 1. Types of parallelism

  1. Dividing problems is the essence of parallelism. Division into independent subproblems gives independent parallelism, while dividing into incremental computations gives pipeline parallelism. Set mappings naturally adapt to independent parallelism (the same instruction is independently applied to each element of a set) while stream mappings adapt to pipeline parallelism(some instructions are successively applied to each element of a stream).Thus, sets and streams suggest a divide-and-conquer format for specifying mappings which is implicitly also a format for specifying parallelism.
  2. Divide-and-conquer computations can be represented by combining these types of parallelism. “Dividing” a problem is represented by fan-out nodes in the graph, while conquering gathers results into a set (with independent parallelism), a stream (with pipeline parallelism), and/or an aggregate (with fan-in parallelism). Thus, divide-and-conquer solutions of problems naturally capture these kinds of parallelism.
  3. Relational algebra operators can often be naturally expressed as divide-and-conquer computations.

These ideas raise hope for a parallel data processing system that rests upon divide-and-conquer techniques. However, such a system must deal with several technical issues to be viable. A first problem is that the relational model offers no way to talk about order among data (e.g., sorted relations, or ordered tuples). Relational languages are therefore inadequate for specifying “stream processing” in which ordered sequences of data are processed sequentially. Hence, streams cannot be exploited to specify pipeline parallelism following a data-parallelism paradigm.Pipeline parallelism is generally used, transparently to the user, in lower-level languages implementing relational algebra (e.g., PLERA  or PFAD).

A second problem is that parallel data processing requires effective data partitioning capabilities. Typically, a relational query (select-project-join expression)is translated into a low-level form of relational algebra with explicit (low-level)parallel constructs. Data partitioning is used to spread the computation of relational algebra operators among parallel processors. This partitioning is typically defined during the physical database design and then exploited by a compiler. Most of the time, a partitioned computation requires that processors exchange intermediate results in order to compute the final result.

Ideally, data partitioning must be expressible by the programmer or the compiler within a parallel database language. This is essential to automatically extract parallelism and lead to efficient implementations on parallel database systems.Specifying parallel computations over relations often requires specifying how data partitioning (fan-out parallelism) will be done and how distributed results will be collected (fan-in parallelism). Database models have been developed before that permit expression of both ordering among tuples and data partitioning. For example, the FAD language of Bubba has operators that express various forms of fan-out and fan-in parallelism. FAD is a strongly typed set-oriented database language based on functional programming and relational algebra. It provides a fixed set of higher-order functions to aggregate functions, like the pump parametrized aggregate operator and the grouping operator. The pump operator applies a unary function to each element of a set, producing an intermediate set which is then “reduced” to a single datum using a binary function that combines the intermediate set elements. Indeed, pump naturally expresses a special case of fan-out and fan-in parallelism. At the same time, the group operator permits set partitioning.

The SVP model goes one step further in allowing sets, streams and parallelism to be captured in a unified framework formalizing divide-and-conquer mappings. SVP models collections, whose specialization leads to sets or streams, as series-parallel graphs which ease expressing parallel data processing. An important class of queries called transducers generalizes aggregate operations and set or stream operations. They lead to high-level specification of independent and pipeline parallelism. Thus, SVP is a possible formal foundation for further research in parallel data programming languages.

Tags : , , , , , , , , , , ,

Parallel database system solution

Today, in a competitive world, enterprises of all kinds use and depend on timely available, up-to-date information. Information volumes are growing 25-35% per year and the traditional transaction rate has been forecast to grow by a factor of 10 over the next five years-twice the current trend in mainframe growth. In addition, there is already an increasing number of transactions arising from computer systems in business-to-business interworking and by intelligent terminals in the home, office or factory.

The profile of the transaction load is also changing as decision-support queries, typically complex, are added to the existing simpler, largely clerical workloads. Thus, complex queries such as those macro-generated by decision support systems or system-generated as in production control will increase to demand significant throughput with acceptable response times. In addition, very complex queries onvery large databases, generated by skilled staff workers or expert systems, may hurt throughput while demanding good response times.

From a database point of view, the problem is to come up with database servers that support all these types of queries efficiently on possibly very largeon-line databases. However, the impressive silicon technology improvements alone cannot keep pace with these increasing requirements. Microprocessor performance is now increasing 50% per year, and memory chips are increasing in capacity by a factor of 16 every six years. RISC processors today can deliver between 50 and 100 MIPS (the new 64 bit DEC Alpha processor is predicted to deliver 200 MIPS at cruise speed!) at a much lower price/MIPS than mainframe processors. This is in contrast to much slower progress in disk technology which has been improving by a factor of 2 in response time and throughput over the last 10 years. With such progress, the I/O bottleneck worsens with time.

The solution is therefore to use large-scale parallelism to magnify the raw power of individual components by integrating these in a complete system along with the appropriate parallel database software. Using standard hardware components is essential to exploit the continuing technology improvements with minimal delay. Then, the database software can exploit the three forms of parallelism inherent in data-intensive application workloads. Interquery parallelism enables the parallel execution of multiple queries generated by concurrent transactions. Intraquery parallelism makes the parallel execution of multiple, independent operations (e.g.,select operations) possible within the same query. Both interquery and intraquery parallelism can be obtained by using data partitioning. Finally, with intraoperation parallelism, the same operation can be executed as many sub-operations using function partitioning in addition to data partitioning. The set-oriented mode of database languages (e.g., SQL) provides many opportunities for intraoperation parallelism. For example, the performance of the join operation can be increased significantly by parallelism.

Tags : , , , , , , , , , ,