Phase | T1 | T2 | T3 |
---|---|---|---|
A | 7 6 8 4 |
9 5 |
|
B | 7 6 |
9 8 5 4 |
|
C | 9 8 5 4 |
7 6 |
|
D | 9 8 7 6 5 4 |
Several interesting details have been omitted from the previous illustration. For example, how were the initial runs created? And, did you notice that they merged perfectly, with no extra runs on any tapes? Before I explain the method used for constructing initial runs, let me digress for a bit.
In 1202, Leonardo Fibonacci presented the following exercise in his Liber Abbaci (Book of the Abacus): "How many pairs of rabbits can be produced from a single pair in a year's time?" We may assume that each pair produces a new pair of offspring every month, each pair becomes fertile at the age of one month, and that rabbits never die. After one month, there will be 2 pairs of rabbits; after two months there will be 3; the following month the original pair and the pair born during the first month will both usher in a new pair, and there will be 5 in all; and so on. This series, where each number is the sum of the two preceeding numbers, is known as the Fibonacci sequence:
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ... .Curiously, the Fibonacci series has found wide-spread application to everything from the arrangement of flowers on plants to studying the efficiency of Euclid's algorithm. There's even a Fibonacci Quarterly journal. And, as you might suspect, the Fibonacci series has something to do with establishing initial runs for external sorts.
Recall that we initially had one run on tape T2, and 2 runs on tape T1. Note that the numbers {1,2} are two sequential numbers in the Fibonacci series. After our first merge, we had one run on T1 and one run on T2. Note that the numbers {1,1} are two sequential numbers in the Fibonacci series, only one notch down. We could predict, in fact, that if we had 13 runs on T2, and 21 runs on T1 {13,21}, we would be left with 8 runs on T1 and 13 runs on T3 {8,13} after one pass. Successive passes would result in run counts of {5,8}, {3,5}, {2,3}, {1,1}, and {0,1}, for a total of 7 passes. This arrangement is ideal, and will result in the minimum number of passes. Should data actually be on tape, this is a big savings, as tapes must be mounted and rewound for each pass. For more than 2 tapes, higher-order Fibonacci numbers are used.
Initially, all the data is on one tape. The tape is read, and runs are distributed to other tapes in the system. After the initial runs are created, they are merged as described above. One method we could use to create initial runs is to read a batch of records into memory, sort the records, and write them out. This process would continue until we had exhausted the input tape. An alternative algorithm, replacement selection, allows for longer runs. A buffer is allocated in memory to act as a holding place for several records. Initially, the buffer is filled. Then, the following steps are repeated until the input is exhausted:
Figure 4-2 illustrates replacement selection for a small file. The beginning of the file is to the right of each frame. To keep things simple, I've allocated a 2-record buffer. Typically, such a buffer would hold thousands of records. We load the buffer in step B, and write the record with the smallest key (6) in step C. This is replaced with the next record (key 8). We select the smallest key >= 6 in step D. This is key 7. After writing key 7, we replace it with key 4. This process repeats until step F, where our last key written was 8, and all keys are less than 8. At this point, we terminate the run, and start another.
Step | Input | Buffer | Output |
---|---|---|---|
A | 5-3-4-8-6-7 | ||
B | 5-3-4-8 | 6-7 | |
C | 5-3-4 | 8-7 | 6 |
D | 5-3 | 8-4 | 7-6 |
E | 5 | 3-4 | 8-7-6 |
F | 5-4 | 3 | 8-7-6 | |
G | 5 | 4-3 | 8-7-6 | |
H | 5-4-3 | 8-7-6 |
When selecting the next output record, we need to find the smallest key >= the last key written. One way to do this is to scan the entire list, searching for the appropriate key. However, when the buffer holds thousands of records, execution time becomes prohibitive. An alternative method is to use a binary tree structure, so that we only compare lg n items.