Posts Tagged ‘skew

Specialized Parallel Relational Operators

Some algorithms for relational operators are especially appropriate for parallel execution, either because they minimize data flow, or because they better tolerate data and execution skew. Improved algorithms have been found for most of the relational operators. The evolution of join operator algorithms is sketched here as an example of these improved algorithms.

Recall that the join operator combines two relations, A and B, to produce a third relation containing all tuple pairs from A and B with matching attribute values. The conventional way of computing the join is to sort both A and B into new relations ordered by the join attribute. These two intermediate relations are then compared in sorted order, and matching tuples are inserted in the output stream. This algorithm is called sort-merge join.

Many optimizations of sort-merger join are possible, but since sort has execution cost nlog(n), sort-merge join has an nlog(n) execution cost. Sort-merge join works well in a parallel dataflow environment unless there is data skew. In case of data skew, some sort partitions may be much larger than others. This in turn creates execution skew and limits speedup and scaleup. These skew problems do not appear in centralized sort-merge joins.

Hash-join is an alternative to sort-merge join. It has linear execution cost rather than nlog(n) execution cost, and it is more resistant to data skew. It is superior to sort-merge join unless the input streams are already in sorted order. Hash join works as follows. Each of the relations A and B are first hash partitioned on the join attribute. A hash partition of relation A is hashed into memory. The corresponding partition of table relation B is scanned, and each tuple is compared against the main-memory hash table for the A partition. If there is a match, the pair of tuples are sent to the output stream. Each pair of hash partitions is compared in this way.

The hash join algorithm breaks a big join into many little joins. If the hash function is good and if the data skew is not too bad, then there will be little variance in the hash bucket size. In these cases hash-join is a linear-time join algorithm with linear speedup and scaleup. Many optimizations of the parallel hash-join algorithm have been discovered over the last decade. In pathological skew cases, when many or all tuples have the same attribute value, one bucket may contain all the tuples. In these cases no algorithm is known to speedup or scaleup.

Tags : , , , , , , , , , , , , , , ,

Parallel Dataflow Approach to SQL Software

Terabyte online databases, consisting of billions of records, are becoming common as the price of online storage decreases. These databases are often represented and manipulated using the SQL relational model. A relational database consists of relations (files in COBOL terminology) that in turn contain tuples (records in COBOL terminology). All the tuples in a relation have the same set of attributes (fields in COBOL terminology).

Relations are created, updated, and queried by writing SQL statements. These statements are syntactic sugar for a simple set of operators chosen from the relational algebra. Select project, here called scan, is the simplest and most common operator – it produces a row-and column subset of a relational table. A scan of relation R using predicate P and attribute list L produces a relational data stream as output. The scan reads each tuple, t, of R and applies the predicate P to it. If P(t) is true, the scan discards any attributes of t not in L and inserts the resulting tuple in the scan output stream. Expressed in SQL, a scan of a telephone book relation to find the phone numbers of all people named Smith would be written:

SELECT  telephone_number  /* the output attribute(s) */

FROM  telephone_book  /* the input relation */

WHERE  last_name = ‘Smith’;  /* the predicate */

A scan’s output stream can be sent to another relational operator, returned to an application, displayed on a terminal, or printed in a report. Therein lies the beauty and utility of the relational model. The uniformity of the data and operators allow them to be arbitrarily composed into dataflow graphs. The output of a scan may be sent to a sort operator that will reorder the tuples based onan attribute sort criteria, optionally eliminating duplicates. SQL defines several aggregate operators to summarize attributes into a single value, for example, taking the sum, min, or max of an attribute, or counting the number of distinct values of the attribute. The insert operator adds tuples from a stream to an existing relation. The update and delete operators alter and delete tuples in a relation matching a scan stream.

The relational model defines several operators to combine and compare two or more relations. It provides the usual set operators union, intersection, difference, and some more exoticones like join and division. Discussion here will focus on the equi-join operator (here called join). The join operator composes two relations, A and B, on some attribute to produce a third relation. For each tuple, ta, in A, the join finds all tuples, tb, in B with attribute value equal to that of ta. For each matching pair of tuples, the join operator inserts into the output steam a tuple built by concatenating the pair. Codd, in a classic paper, showed that the relational data model can represent any form of data, and that these operators are complete. Today, SQL applications are typically a combination of conventional programs and SQL statements. The programs interact with clients, perform data display, and provide high-level direction of the SQL dataflow. The SQL data model was originally proposed to improve programmer productivity by offering a non-procedural database language. Data independence was and additional benefit; since the programs do not specify how the query is to be executed, SQL programs continue to operate as the logical and physical database schema evolves.

Parallelism is an unanticipated benefit of the relational model. Since relational queries are really just relational operators applied to very large collections of data, they offer many opportunities for parallelism. Since the queries are presented in a non-procedural language, they offer considerable latitude in executing the queries. Relational queries can be executed as a dataflow graph. As mentioned in the introduction, these graphs can use both pipelined parallelism and partitioned parallelism. If one operator sends its output to another, the two operators can execute in parallel giving potential speedup of two.

The benefits of pipeline parallelism are limited because of three factors: (1) Relational pipelines are rarely very long – a chain of length ten is unusual. (2) Some relational operators donot emit their first output until they have consumed all their inputs. Aggregate and sort operators have this property. One cannot pipeline these operators. (3) Often, the execution cost of one operator is much greater than the others (this is an example of skew). In such cases, the speedup obtained by pipelining will be very limited. Partitioned execution offers much better opportunities for speedup and scaleup. By taking the large relational operators and partitioning their inputs and outputs, it is possible to use divide-and-conquer to turn one big  job into many independent little ones. This is an ideal situation for speedup and scaleup. Partitioned data is the key to partitioned execution.

Tags : , , , , , , , , , , , , , ,