CPS222 Lecture: Sorting                                  Last Revised 3/13/2015

Objectives:

1. To introduce basic concepts (what do we mean by sorting? internal and
   external sorts.  Stability)
2. To introduce common internal sorting algorithms.
3. To introduce the basic external merge sort algorithm.
4. To prove that sorting by comparison is omega(n log n)

Materials: 

1. Copy of Knuth volume 3 to show
2. Projectable of various internal sorting algorithms
3. Handout with above code
4. Projectable of bubble sort tree for three items that are permutations of ABC

I. Introduction
-  ------------

   A. The topic of sorting is a very important one in the area of data
      structures and algorithms, because many computer applications make use
      of sorting in some form or fashion.  As a result, this area has been
      studied extensively, and numerous algorithms with varying performance
      strengths have been developed.

   B. The basic goal of sorting is fairly intuitive - arrange a group of
      items in ascending (or descending) order.  But there are a few
      nuances we need to consider.

      1. Sometimes, we are sorting items such as numbers or names, where
         the entire entity we are sorting also serves as the basis for the
         sort.  At other times, we are sorting complex structures based on
         one piece of information - traditionally called the SORT KEY - or
         just the key for short.  
         
         a. Sometimes, the same list of items may even be sorted using different
            sort keys at different times.

            Example: Suppose we create a class Student with instance variables
                     like the following:

                     id (an integer)
                     last name (a string)
                     first name (a string)
                     major (a string)
                     class year (an integer)
                     gpa (a real)

            If we have a list of Student objects, it is easy to imagine
            different circumstances under which we would want to use any of 
            these instance variables as a sort key - or perhaps even use last 
            name and first name together as a COMPOSITE KEY.  (Sort based on 
            last name, use first name to break ties.)
            
         b. For simplicity, we will discuss algorithms for sorting a list that
            is "all key" - but the same principles could be used for sorting a
            list of objects where one field is the sort key.

      2. It turns out that sorting algorithms are generic in the sense that
         (in almost every case) a given algorithm will work with any type of 
         sort key that is comparable.

         a. Basically, a C++ type or class is comparable if it defines an 
            operator <.  (This includes numeric types, strings, and any
            class for which the class author defines operator <).

         b. Java has a similar notion with an interface  called Comparable.  
            To implement Comparable, a class must have a method called compareTo 
            which, when applied to another object of the same class, 
            returns a negative value if it is less than the other object, 0 if 
            it is equal, and a positive value if it is greater than the other 
            object.

            Note that in Java sorting algorithms are implemented slightly
            differently for sorting objects of primitive type (like ints) - 
            where the built-in operator < is used - and for sorting objects of 
            class type (including String) - where the class must implement 
            Comparable and the compareTo() method is used in the algorithm.

         c. As we develop our algorithms, we won't worry about how the sort
            key is actually defined for the objects we are sorting - we only
            require that the objects to which our sorting algorithms are
            applied somehow define a comparison operation (<) that is
            meaningful for objects of that type.  We'll use < as the comparison
            operator without regard to language-specific nuances.

         d. Note, too, that we will discuss algorithms in terms of sorting
            into ascending order.  The same algorithms can be used for
            sorting into descending order, except that we reverse the order of
            comparison.

      3. In general, we can consider implementing a sorting operation for
         almost any sort of collection: an array, a vector, a linked list, 
         etc. Traditionally, sort algorithms have been formulated on
         the assumption that the items being sorted are in an array whose
         elements are accessible by operator [].  We'll use array terminology,
         recognizing that the collection we are sorting is not
         necessarily an array.

      4. We are now prepared to define what we mean when we say that an
         array is sorted: we say that an array of entities x[0 .. n-1] is
         sorted if x[0] <= x[1] <= x[2] ... <= x[n-1] - or, more simply,
         x[i] <= x[j] whenever 0 <= i <= j <= n-1.

      5. We can also define what we mean by sorting an array: we sort an
         array by permuting in such a way as to produce a sorted array.

         Note that we explicitly require that the sorted array be a permutation
         of the original array.  This precludes the use of a simplistic "sort"
         algorithm like the following:

         for (int i = 0; i < n; i ++)
             x[i] = x[0];

         This results in a sorted array by our definition, but we don't
         consider this a proper sorting algorithm because the result is not -
         in general - a permuation of what we started with!

   C. We will study a variety of sorting algorithms, because there is no one
      best algorithm for all cases.

      1. Different algorithms work best for different SIZE arrays.

         a. For all but very large arrays, we will typically use an INTERNAL
            SORT, in which the items to be sorted are kept in main memory.
            For very large arrays, we will have to use an EXTERNAL SORT in
            which the items to be sorted reside in secondary storage (disk
            or tape) and are brought into main memory to be sorted a few at
            a time.

            i. Internal sorts are much faster than external sorts, because the
               access time of external storage devices is orders of magnitude
               greater than that for main memory.  

           ii. However, internal sorts are limited as to the amount of data
               they can handle by available main memory, while external sorts
               are limited by available external storage (which is generally
               orders of magnitude bigger.)  (Note, too, that, with virtual 
               memory, main memory appears almost boundless; but if the amount
               of memory in use becomes too great, then paging begins to occur,
               and the performance of the internal sort begins to deteriorate.)
               
               As memory sizes have grown, external sorting has become
               unnecessary for many applications, but it still important for
               algorithms that deal with big data.

         iii. Often, an external sorting algorithm will make use of an internal
              sort, done on a portion of the data at one time, to give it a
              "head start".

              (We will focus on internal sorts first, and then will talk about
               external sorts.)

         b. Among internal sorts, there are several algorithms with theta(n^2)
            behavior, and several with theta(n log n) behavior.  Interestingly,
            for sufficiently small arrays, a theta(n^2) algorithm may be
            faster than a theta(n log n) algorithm, due to a smaller constant
            of proportionality.  Moreover, when implementing a recursive "divide
            and conquer" theta(n log n) algorithm, it is common to switch to
            using a theta(n^2) algorithm when the pieces become sufficiently
            small.

      2. Some algorithms are quite sensitive to the presence of some
         initial order in the items being sorted.  Some algorithms do
         better when this is the case; others actually do worse (they
         work best on totally random data.)

      3. Some algorithms require significant additional space beyond that
         needed to store the actual data to be sorted; others require very
         little additional space.  Extra space required can range from
         theta(1) to theta(n) extra space.

      4. In some cases, STABILITY of the algorithm is an important
         consideration.

         a. The issue of stability arises if we are sorting an array where
            duplicate keys are allowed - i.e. two (or more entries) may
            legally have the same key.  (Example: sorting a list of people
            by last name.)

         b. A sort is said to be STABLE if two records having the same key
            value are guaranteed to be in the same relative order in the
            output as they were in the input.

         c. Example: Suppose we were sorting bank transactions, each
            consisting of an account id, transaction code, and amount - e.g.

                5437    D       100.00
                1234    D        50.00
                5437    W        50.00
                1234    W        20.00

            (where the sort key is just the account number.)

            i. A stable sorting algorithm would be guaranteed to produce:

                1234    D        50.00
                1234    W        20.00
                5437    D       100.00
                5437    W        50.00

           ii. While an unstable one might produce the above, or any of
               the following instead:

                1234    W        20.00
                1234    D        50.00
                5437    D       100.00
                5437    W        50.00

                1234    D        50.00
                1234    W        20.00
                5437    W        50.00
                5437    D       100.00

          or    1234    W        20.00
                1234    D        50.00
                5437    W        50.00
                5437    D       100.00

               Here, the stable sort might be necessary to ensure correctness
               if one of the withdrawl transactions, in fact, represents a
               withdrawl against the funds deposited earlier - i.e. there was
               not enough money in the account to cover the withdrawl before
               the deposit was made.

         d. Stability is never an issue if the sort keys are guaranteed to
            be unique - i.e. no two items can have the same value of the key.

   D. A classic work on sorting is Donald Knuth: The Art of Computer
      Programming volume 3: Sorting and Searching.

II. Approaches to Internal Sorting 
--  ---------- -- -------- -------

   We will begin by considering sorting algorithms that are primarily used
   for internal sorts.  There are a number of basic approaches to sorting, 
   including the following (classification from Knuth volume 3).  (For
   consistency, I will illustrate each with sample code that sorts an array
   of strings - but the algorithm is the same regardless of what one is
   sorting)
   
   DISTRIBUTE HANDOUT

   A. Sorting by insertion:

        for (int i = 1; i < n; i ++)
           insert the ith entry from the original array into a
              sorted subtable composed of entries 0..i-1

      1. Demonstrate with class

      2. Many texts have an algorithm for a straight insertion sort.

         a. Example Code: PROJECT/HANDOUT

         b. Analysis:

            ASK

            theta(n^2)

         c. What will happen if used on already sorted data?

            ASK

            Time becomes theta(n), because inner loop terminates immediately
            on each time through outer loop.  This is a peculiar characteristic
            of this algorithm which it makes it advantageous to use in cases
            where there is a significant probability that the data will already
            be in order.

      3. The Shell sort is an insertion sort with behavior approximately 
         O(n^1.26) - closed form analysis being very difficult.  (We won't
         discuss)

      4. Another variant of insertion sort is address calculation sort.

         a. This builds on the idea that if we are manually sorting a pile
            of papers, and we see a paper with a lastname beginning with
            'B', we automatically start looking for its near the beginning 
            of the pile; if it begins with 'T', we look near the end;
            and if it begins with 'M' we look near the middle.

         b. One approach is to conduct insertion sort with several lists,
            instead of one, each corresponding to a certain range of
            key values (e.g. A-C, D-F ...).  An item is inserted using
            the methods of insertion sort into the appropriate list, and
            then they are all combined at the end.
            
         c. We won't develop further

      5. Simple insertion sort is stable and address calculation sort are
         stable, but Shell Sort is not.

   B. Sorting by exchanging:

        scan the table repeatedly (by some scheme), looking for
         items that are not in the correct sequence vis-a-vis each
         other, and exchange them.

      1. Almost every intro computer science text discusses the bubble sort,
         which is an exchange sort.

         a. Example Code: PROJECT / HANDOUT

         b. Analysis?

            ASK

            theta(n^2) - but with a larger constant of proportionality than
            insertion sort, because multiple exchanges can be done on each
            pass through the outer loop, while insertion sort does simple
            data movements rather than exchanges.

         c. What will happen if used on already sorted data?

            ASK

            In this case, there is no asymptotic gain (though no exchanges
            are done, so the overall time is better by a constant factor.)

            There are improvements to the algorithm that terminate early if
            no exchanges are done on some pass, yielding potentially theta(n)
            behavior on sorted data.

         d. The chief reason for this sort being so widely known is that
            the code is so simple.

      2. Quicksort 

         a. The basic idea is this:

            i. Choose an arbitrary element of the list as the pivot element.

           ii. Rearrange the list as follows:

                keys <= pivot   pivot   keys >= pivot

                (Note that a key that is equal to the pivot can end up in
                 either half)

          iii. Sort the two partitions recursively

         b. CODE - PROJECT Goodrich/Tamassia Code Fragment 11.5

            i. This version makes the arbitrary choice of using the last 
               element as the pivot.

           ii. Note that we consider this sort to be an exchange sort because
               of the method used to do the partitioning.

         c. Analysis: We consider average case and worst case separately:

            i. Average case - we expect each partition to divide the list 
               roughly in half.  We can thus picture the partitioning process 
               by a tree like the following:

                                       n items

                               n/2              n/2

                       n/4      n/4             n/4     n/4

                       ....................................

               1 1 1 ......................................... 1 1 1

                - At each "level", we must examine all n items to create
                  the next level of partitions.  There are log n levels -
                  therefore QuickSort is O(nlogn), average case.

           ii. In the worst case, QuickSort is not so good, however.
               Consider the behavior for a list that is exactly backward.

               - The first partion produces sublists of 0 and n-1 items.
               - The second produces sublists of 0 and n-2
               ...
               - Therefore, there are n levels of partitioning, each
                 examining theta(n) items - therefore the worst case for
                 QuickSort is theta(n^2).

               What about the case where the list is already sorted to
               begin with? Paradoxically, this too turns out to be theta(n^2).

        d. We can reduce the likelihood of worst case behavior by improving the 
           way we select the pivot element.

           i. Ideally, the key we use as the pivot should be the median
              of the items in the list.  In practice, this involves
              either sorting the list, or using a rather complex theta(n)
              algorithm which we won't discuss.

          ii. One simple improvement is as follows: instead of choosing the 
              first item in the list as the basis for partitioning, choose the 
              median of the (physically) first, (physically) middle, and 
              (physically) last.  (Worst case behavior can still occur, but not 
              with the case of a backward or an already ordered list.)
              
         iii. If our major concern is with avoiding the worst-case behavior that
              comes when the data is already sorted or reverse-sorted, we can
              also select a pivot randomly from among all the items being
              worked on - which may be somewhat simpler to implement. 

         e. In practice, QuickSort is often improved by switching to another
            method (e.g. insertion sort) when the size of the sublist to
            be sorted falls below some threshold.  That is, the recursive
            calls might be coded as follows:

               Present code:

                    if (size <= 1)
                        ; // Do nothing
                    else
                    {
                        ... Quick sort code

               Modified code:

                    if (size <= 1)
                        ; // Do nothing
                    else if (size < threshold)
                    {
                        ... Insertion sort code
                    }
                    else
                    {
                        ... Quick sort code

         f. One other point to note about quicksort is that, due to the
            recursion, it does require additional memory for the stack.

            i. The amount of additional memory needed will vary from O(log n)
               [if each partitioning roughly divides the list in two] to
               O(n) [in the pathological cases where each partitioning
               produces one sublist that is smaller by just 1 item than
               the list that was partitioned.]

           ii. The stack growth can be kept to O(log n) in all cases as
               follows: Always sort the smaller of the two sublists first,
               and use tail recursion optimization on the second call in each
               case 

      3. The bubble sort is stable, but quicksort is not.

   C. Sorting by selection:

        for (i = 0;  i < n; i ++)
           select the smallest (largest) item from those still under
              consideration, put it in the right place, and remove it from 
              consideration on further passes

      1. Demonstrate with class

      2. Many texts give an algorithm for a straight selection sort.

         a. Example Code: PROJECT / HANDOUT

         b. Analysis:

            ASK

            theta(n^2).  Constant of proportionality tends to be better than
            insertion sort, because there is only one data movement done
            per pass through the outer loop.

         c. What will happen if used on already sorted data?

            ASK - Nothing is gained or lost

      3. Heapsort is a selection sort method
      
         a. The text discussed heapsort in conjunction with its discussion of
            heaps, though I postponed the reading of this material until now 

            1. We have already seen that it is possible to convert an array
               to a heap enmasse in theta(n) time.  Suppose we were to build a 
               maxheap (largest item is on top of the heap.)  Clearly that item 
               belongs at the _end_ of a sorted version of the original array.  

            2. We have also seen that it is possible to remove the top item from
               a heap and replace it by its appropriate successor in 
               theta(log n) time.  

         b. This leads to the following approach to sorting:

            Convert the array into a maxheap

            for (i = 0; i < n; i ++)
                remove the top item from the heap and put it i slots from
                  the end of the sorted array; then readjust the heap

         c. Example code: PROJECT / HANDOUT

            Demonstrate phase 2 (after heap built) using student names

         d. Analysis

            ASK

            Since the first step takes theta(n) time and the loop does a
            theta(log n) operation n times, the total time will be
            theta(n) + n * theta(log n) = theta(n log n) 

      4. Neither simple selection sort nor heapsort is stable, though simple
         selection can be made stable at the cost of both extra time and space.

   D. Sorting by merging

      1. Suppose we have two sorted lists.  It is easy to merge them
         into a single sorted list in theta(n) time

         for (i = 0; i < n; i ++)
            choose the smaller item from the fronts of the two lists,
              and add it to the sorted list.  (If one list is empty,
              always take from the other list)

         a. Demonstration: merge two sorted lists of student names

         b. This leads to a recursive sorting strategy:

            - Split the data in half
            - Sort each half recursively
            - Merge the two sorted halves

         c. Example code: PROJECT / HANDOUT

         d. Analysis:

            ASK

            Guaranteed theta(n log n) - by similar reasoning used to show
            that quick sort is theta(n log n) - but this time, we can
            guarantee perfect partitioning, so this asymptotic bound holds
            for all cases

         e. Moreover, if we break ties by always choosing from the list thast
            came from nearer the start of the original list, we can guarantee
            that merge sort is stable.

         f. Unfortunately, we require theta(n) extra space to store the merged
            list - or we can use linked lists, which require theta(n) extra
            space for the links!

      2. We will see shortly that merge sorting is the basis for all external
         sorting strategies - though sometimes we sacrifice stability for 
         extra speed.

   E. Sorting by distribution:

      1. This works with a key of m "digits", using d "pockets" where
         d is the number of possible values a key digit may assume
         (e.g. 10 for a decimal key; 26 for an alphabetic key etc.)

            for (i = 0; i <  m;  i ++)
                distribute the file into d pockets based on the ith
                  key digit from the right
                reconstruct the file by appending the pockets to one another.

            Example: Assume we are sorting strings of three letters, drawn
                     from the alphabet ABCDE [so we need 5 pockets]

            Initial data:               CBD
                                        ADE
                                        CAD
                                        ADA
                                        BAD
                                        ACE
                                        BEE
                                        BED

            First distribution - on rightmost character:

                ADA     (empty) (empty) CBD     ADE
                                        CAD     ACE
                                        BAD     BEE
                                        BED

            Pickup left-to-right:       ADA
                                        CBD
                                        CAD
                                        BAD
                                        BED
                                        ADE
                                        ACE
                                        BEE

            Second distribution:

                CAD     CBD     ACE     ADA     BED
                BAD                     ADE     BEE

            Pick up:                    CAD
                                        BAD
                                        CBD
                                        ACE
                                        ADA
                                        ADE
                                        BED
                                        BEE

            Third distribution:

                ACE     BAD     CAD
                ADA     BED     CBD
                ADE     BEE

            Final pickup:               ACE
                                        ADA
                                        ADE
                                        BAD
                                        BED
                                        BEE
                                        CAD
                                        CBD

      2. Time complexity appears to be order (n*m) - but note that for
         n keys we have a minimum value for m of log n - therefore, it
                                                    d
         is in fact theta(nlogn) - since log n and log n have a constant ratio.
                                            2         d

      3. Unfortunately, distribution sorting requires extra space; though the
         extra space requirements can be kept down by careful coding.

         a. If the "pockets" were represented by arrays, then we would
            need one array for each possible value of a digit - e.g.
            26 pockets if sorting based on letters of the alphabet.  Further,
            each pocket would need to be big enough to possibly hold
            all the data if, in fact, all the keys had the same value in
            one position.  Thus, we would need O(n) extra space, where the
            constant of proportionality would be huge.

         b The extra space can be greatly reduced - though it is still
            O(n) - by represented the "pockets" by linked lists, using
            a table of links as in the previous example.  This is really
            the only practical way to go.

      4. Distribution sorting is always stable; in fact, it relies on
         the stability of later passes to preserve the work done on
         earlier ones.
         
      5. (No demo code for this one - but the book discusses briefly under
          the name "bucket sort")

   F. Sorting by enumeration:

                For each record, determine how many records have keys less
                  than its key.  We will call this value for the ith 
                  record count[i].

                Clearly, the record currently in position i actually belongs
                  in position count[i] + 1, so as a final step we put it there.

      1. Observe that this strategy is theta(n^2), and is stable [because
         if two records have equal keys, we increase the count of the one
         occurring physically later.]

      2. An interesting variant is possible if the set of possible keys is
         small (i.e. many items have the same key.)

         a. Example: Sort the students by academic class - using two arrays:

                 count[i] and position[i] (1 <= i <= 4) 

            i. We make one pass through all the students to calculate count[].

           ii. position[1] is set to 0

          iii. position[i] (2 <= i <= 4) is set to position[i-1] + count[i-1]

           iv. Make a second pass through all the students and place according
               to current value of position[] for his/her class, then increment
               position.

         b. Analysis:

            ASK

            O(n)

            - but special case!
            
      3. (No demo code for this one)

   G. Summary 

      We can compare the internal sorting strategies we have looked at
      thus far by considering several attributes;

      1. Asymptotic complexity

      2. Behavior with already sorted data.

      3. Need for additional storage.

      4. Stability

        Algorithm       Asymptotic      Impact of       Extra           Stable?
                        Complexity      Sorted Data     Storage
        ---------       ----------      -----------     -------         ------

        Simple          theta(n^2)      becomes         minimal         yes
        Insertion                       theta(n).

        Bubble          theta(n^2)      can become      minimal         yes
                                        theta(n) 
                                        w/suitable coding

        Quicksort       theta(n log n)  can degenerate  theta(log n)    no
                        average -       to theta(n^2)   stack
                        can degenerate  unless avoided  for
                        to theta(n^2)   by coding       recursion

        Simple          theta(n^2)      little change   minimal         no
        Selection

        Heapsort        theta(n log n)  little change   minimal         no
                        always

        Merge Sort      theta(n log n)  little change   theta(n) for    yes
                                                        extra array
                                                        or at least
                                                        links

        Distribution    If keys are     little change   theta(n)        yes
        Sort            unique, ends
                        up theta (n log n)

        Enumeration     theta(n^2)      little change   theta(n) for    yes
        Sort                                            counts
                        -- can be theta(n)
                        for special case where potential
                        key values form a small set.

      The result of this analysis shows that there is no one algorithm
      that's best on all counts.  In particular, there is no known sorting
      algorithm that has all of the following characteristics: theta(n log n)
      asymptotic complexity, theta(1) extra space, and stability.  We can get
      any two of the three, but not all three!

III. How Fast Can We Sort?
---  --- ---- --- -- ----

   A. One observation one can make from the table we just considered is that
      the best general-purpose sorting algorithms have asymptotic complexity 
      theta(n log n).  Is this as good as we can do, or is it possible to find 
      an algorithm whose average case asymptotic complexity is less than 
      n log n?

   B. In the case of sorts based on binary comparisons, the answer to our
      question is no.  We will now prove the following theorem:

      Theorem: Any sort BASED ON BINARY COMPARISONS must have complexity at 
               least O(nlogn).

   C. Proof: 

      1. Any sorting algorithm for sorting n items must be prepared to
         deal with all n! possible permutations of those items, and must
         deal with each permutation differently.

      2. Each comparison in the sort serves to partition the permutations
         into two classes - those passing the test, and those failing the
         test.

         Example: bubble sort of three items must deal with 6 permutations:

                  ABC ACB BAC BCA CAB CBA

         The first comparison checks to see if item[0] is <= item[1].  Three
         permutations pass this test: ABC, ACB, and BCA.  The other three
         (BAC, CAB, CBA) fail the test, necessitating an exchange.

      3. Each subsequent comparison partitions each of these classes
         further.

         Example: the second comparison checks to see if item[1] <= item[2].
         Of the three permutations passing the first test, one passes the
         second (ABC) and the other two do not.  Of the three permutations
         failing the first test - and after the exchange - only one passes
         the second test (BAC altered to ABC).
                                             c
      4. After c comparisons, then, we have 2  classes - some of which may
         be empty.

      5. At the completion of the sort, we must have at least n! classes -
         since each original permutation must be handled differently.

         Example: complete classification tree for bubble sort of 3 items:

                        item[0] <= item[1]
                        / no            \ yes   
        (BAC, CAB, CBA)                 (ABC, ACB, BCA)
               | become                        |
        (ABC, ACB, BCA)                        |
               |                               |
          item[1] <= item[2]              item[1] <= item[2]
         / no            \ yes            / no             \ yes
      (ACB, BCA)        (ABC)           (ACB, BCA)        (ABC)
          | become        |               | become          |
      (ABC, BAC)          |             (ABC, BAC)          |
          |               |               |                 |
  item[0] <= item[1] item[0] <= item[1] item[0] <= item[1] item[0] <= item[1]
     / no      \ yes    / no      \ yes  / no        \ yes  / no        \ yes
   (BAC)      (ABC)   (empty)   (ABC)   (BAC)      (ABC)  (empty)      (ABC)
     | becomes                            | becomes
   (ABC)                                (ABC)

         PROJECT
         
         After 3 comparisons, we have eight classes - 6 of which contain one item 
         (corresponding to each of the 3! original permutations) and 2 of which are empty.

                        c
      6. Thus, we have 2  >= n!, or c >= log(n!)

      7. However, by Stirling's approximation, n! ~ sqrt(2 pi n) * (n/e)^n

         so log(n!) ~ 0.5(1 + log(pi) + log(n)) + n (log(n) - log(e))

                    = n logn + O(n) + O(logn) + O(1)

                    = O(n log n)

         so c >= O(n log n) = omega (n log n)  - QED

      8. Note: our text argues that log(n! is omega (n log n) in a different
         way - same conclusion, just a different way of getting there.

IV. External Sorting
--- -------- --------

   A. We have seen that the algorithms we use for searching tables stored on
      disk are quite different from those used for searching tables stored in
      main memory, because the disk access time dominates the processing time.

   B. For much the same reason, we use different algorithms for sorting
      information stored on disk than for sorting information in main memory.

      1. We call an algorithm that sorts data contained in main memory an
         INTERNAL SORTING algorithm, while one that sorts data on disk is
         called an EXTERNAL SORTING algorithm.

      2. In the simplest case - if all the data fits in main memory - we
         can simply read the data from disk into main memory, sort it using
         an internal sort, and then write it back out.

      3. The more interesting case - and the one we consider here - arises
         when the file to be sorted does not all fit in main memory.

      4. Historically, external sorting algorithms were developed in the context
         of systems that used magnetic tapes for file storage, and the 
         literature still uses the term "tape", even though files are most often
         kept on some form of disk.  It turns out, though, that the storage
         medium being used doesn't really matter because the algorithms we will
         consider all read/write data sequentially.

   C. Most external sorting algorithms are variants of a basic algorithm
      known as EXTERNAL MERGE sort.  Note that there is also an internal
      version of merge sort that we have considered.  External merging
      reads data one record at a time from each of two or more files, and
      writes records to one or more output files.  As was the case with
      internal merging, external merging is theta(n log n) for time, but 
      theta(n) for extra space, and (if done carefully) it is stable.

   D. First, though, we need to review some definitions:

      1. A RUN is a sequence of records that are in the correct relative order.

      2. A STEPDOWN normally occurs at the boundary between runs.  Instead
         of the key value increasing from one record to the next, it decreases.

         Example: In the following file: B D E C F A G H

                  - we have three runs (B D E, C F, A G H)

                  - we have two stepdowns (E C, F A)

       3. Observe that an unsorted file can have up to n runs, and up
          to n-1 stepdowns.  In general (unless the file is exactly
          backwards) there will be a lesser number than this of runs and
          stepdowns, due to pre-existing order in the file.

       4. Observe that a sorted file consists of one run, and has no
          stepdowns.

   E. We begin with a variant of external merge sort that one would not use
      directly, but which serves as the foundation on which all the other
      variants build.  

      1. In the simplest merge sort algorithm, we start out by regarding
         the file as composed of n runs, each of length 1.  (We ignore any
         runs which may already be present in the file.)  On each pass, we 
         merge pairs of runs to produce runs of double length.

         a. After pass 1, we have n/2 runs of length 2.

         b. After pass 2, we have n/4 runs of length 4.

         c. The total number of passes will be ceil(log n).  [ Where ceil
            is the ceiling function - smallest integer greater than or equal
            to.]  After the last pass, we have 1 run of length n, as desired.

         d. Of course, unless our original file length is a power of 2, there
            will be some irregularities in this pattern.  In particular, we
            let the last run in the file be smaller than all the rest -
            possibly even of length zero.

            Example: To sort a file of 6 records:

            Initially:          6 runs of length 1
            After pass 1:       3 runs of length 2 + 1 "dummy" run of length 0
            After pass 2:       1 run of length 4 + 1 run of length 2
            After pass 3:       1 run of length 6

      2. We will use a total of three scratch files to accomplish the sort.

         a. Initially, we distribute the input data over two files, so that
            half the runs go on each.  We do this alternately - i.e. first
            we write a run to one file, then to the other - in order to
            ensure stability.

         b. After the initial distribution, each pass entails merging runs
            from two of the scratch files and writing the generated runs on
            the third.  At the end of the pass, if we are not finished, we
            redistribute the runs from the third file alternately back to the
            first two.

         Example: original file:        B D E C F A G H

                  initial distribution: B E F G         (File SCRATCH1)
                                        D C A H         (File SCRATCH2)

                  (remember we ignore runs existing in the raw data)

                  --------------------------------------------------
                  after first merge:    BD CE AF GH     (File SCRATCH3)
                                                                         PASS 1
                  redistribution:       BD AF           (File SCRATCH1)
                                        CE GH           (File SCRATCH2)

                  --------------------------------------------------
                  after second merge:   BCDE AFGH       (File SCRATCH3)
                                                                         PASS 2
                  redistribution:       BCDE            (File SCRATCH1)
                                        AFGH            (File SCRATCH2)

                  --------------------------------------------------
                  after third merge:    ABCDEFGH        (File SCRATCH3)  PASS 3

                  (no redistribution)

      3. Analysis of the basic merge sort

         a. Space: three files, one of length n and two of length n/2.  We
                   can use the output file as one of the scratch files, so
                   the total additional space is two files of length n/2

                   = total scratch space for n records

            In addition, we need internal memory for three buffers - one
            for each of the three files.  In general, each buffer needs to
            be big enough to hold an entire block of data (based on the
            blocksize of the device), rather than a single record.

         b. Time: 

            - Initial distribution involves n reads

            - Each pass except the last involves 2n reads due to merging
              followed by redistribution.  The last pass involves just n reads.

            - Total reads = 2 n ceil(log n), so total IO operations =
              4n ceil(log n)

    F. A significant improvement arises from the observation that our original
       algorithm started out assuming that the input file consists of
       n runs of length 1 - the worst possible case (a totally backward
       file.)  In general, the file will contain many runs longer than one
       just as a consequence of the randomness of the data, and we can use
       these to reduce the number of passes.

       1. Example: The sample file we have been using contains 3 runs, so
                   we could do our initial distribution as follows:

               initial distribution:    BDE AGH         (File SCRATCH1)
                                        CF              (File SCRATCH2)

               after first pass:        BCDEF           (File SCRATCH3)
                                        AGH             (File SCRATCH4)

               after second pass:       ABCDEFGH        (File SCRATCH1)

               (Note: we have assumed the use of a balanced merge; but
                a non-balanced merge could also have been used.)

      2. This algorithm is called a NATURAL MERGE.  The term "natural" 
         reflects the fact that it relies on runs naturally occurring in 
         the data.

      3. However, this algorithm has a quirk we need to consider.

         a. Since we merge one run at a time, we need to know where one run 
            ends and another run begins.  In the case of the previous 
            algorithms, this was not a problem, since we knew the size of
            each run.  Here, though, the size will vary from run to run.

            In the code we just looked at, the the solution to this problem
            involved recognizing that the boundary between runs is marked by
            a stepdown.  Thus, each time we read a new record from an input
            file, we will keep track of the last key processed from that 
            file; and if our newly read key is smaller than that key, then we
            know that we have finished processing one run from that file.

            Example: in the initial distribution above, we placed two
                     runs in the first scratch file.  The space between
                     them would not be present in the file; what we
                     would have is actually BDEAGH.  But the run boundary
                     would be apparent because of the stepdown from E to A.

         b. However, if stability is important to us, we need to be very
            careful at this point.  In some cases, the stepdown between
            two runs could disappear, and an unstable sort could result.
            Consider the following file:

                F E D C B A M1 Z N M2           (where records M1 and M2 have
                                                 identical keys.)

                                                       ___ No stepdown here, so
                                                       |   2 runs look like one:
                                                       v
                Initial distribution:   F | D | B      | N
                                        E | C | A M1 Z | M2

                                        F | D | B N
                                        E | C | A M1 Z | M2

                First pass:             E F | A B M1 N Z
                                        C D | M2
                                            ^
                                            |___ No stepdown here, so two runs
                                                 look like one:

                                        E F    | A B M1 N Z
                                        C D M2 |

                Second pass:            C D E F M2
                                        A B M1 N Z

                Third pass:             A B C D E F M2 M1 N Z
                                                      ^
                                                      |
                                        In the case of equal keys, we take
                                        record from first scratch file before
                                        record from second, since first
                                        scratch file should contain records
                                        from earlier in original file.

         c. If stability is a concern, we can prevent this from occurring
            by writing a special run-separator record between runs in our
            scratch files.  This might, for example, be a record whose
            key is some impossibly big value like maxint or '~~~~~'.
            Of course, processing these records takes extra overhead
            that reduces the advantage gained by using the natural runs.

         d. Analysis:

            i. Space is the same as an ordinary merge if no run separator
               records are used.  However, in the worst case of a totally
               backward input file, we would need n run separator records
               on our initial distribution, thus potentially doubling the
               scratch space needed.

           ii. The time will be some fraction of the time needed by an
               ordinary merge, and will depend on the average length of
               the naturally occurring runs.

               - If the naturally occurring runs are of average length 2, then
                 we save 1 pass - in effect we start where we would be on the
                 second pass of ordinary merge.

               - In general, if the naturally occuring runs are of average
                 length m, we save at least floor(log m) passes.  Thus, if we 
                 use a balanced 2-way merge, our time will be

                        n (1 + ceil(log n - log m)) reads =

                        n (1 + ceil(log n/m)) reads or

                        2n (1 + ceil(log n/m)) IO operations

               - Of course, if run separator records are used, then we actually
                 process more than n records on each pass.  This costs
                 additional time for

                        n/m  reads on first pass
                        n/2m reads on second pass
                        n/4m reads on third pass
                        ...

                        = (2n/m - 1) additional reads, 

                        or about 4n/m extra IO operations

                - Obviously, a lot depends on the average run length in the
                  original data (m).  It can be shown that, in totally random
                  data, the average run length is 2 - which translates into
                  a savings of 1 merge pass, or 2n IO operations.  However, if
                  we use separator records, we would need 2n extra IO operations
                  to process them - so we gain nothing!  (We could still gain
                  a little bit by omitting separator records if stability were 
                  not an issue, though.)

                - In many cases, though, the raw data does contain considerable
                  natural order, beyond what is expected randomly.  In this
                  case, natural merging can help us a lot.

   G. Another improvement builds on the idea of the natural merge by using an 
      internal sort during the distribution phase to CREATE runs of some size.

      1. The initial distribution pass now looks like this - assuming we
         have room to sort s records at a time internally:

         while not eof(infile) do
            read up to s records into main memory
            sort them
            write them to one of the scratch files

      2. Clearly, the effect of this is to reduce the merge time by a
         factor of (log (n/s)) / (log n).  For example, if s = sqrt(n),
         we reduce the merge time by a factor of 2.  The overall time is
         not reduced as much, of course, because

         a. The distribution pass still involves the same number of reads.

         b. We must now add time for the internal sorting!

         c. Nonetheless, the IO time saved makes internal run generation
            almost always worthwhile.

         Example: suppose we need to sort 65536 records, and have room
                  to internally sort 1024 at a time.

         - The time for a simple merge sort is

                65536 * (1 + log 65536) reads + the same number of writes

                = 65536 * 17 * 2 = 2,228,224 IO operations

         - The time with internal run generation is

                65536 * (1+ log 65536/1024) reads + the same number of writes +
                  internal sort time
                = 65536 * 7 * 2 = 917,504 IO operations + 64 1024-record sorts

      3. This process is stable iff the inernal sort used is stable.  If
         stability is not a concern, it is common to use an internal sort
         like quicksort.  (Note that a stable internal sort is either O(n^2),
         or it requires O(n) extra space, which cuts down on the size of the
         initial runs that can be created by internal sorting!)

V. Sorting with multiple keys
-  ------- ---- -------- ----

   A. Thus far, we have assumed that each record in the file to be sorted
      contains one key field.  What if the record contains multiple keys -
      e.g. a last name, first name, and middle initial?

      1. We wish the records to be ordered first by the primary key (last
         name).

      2. In the case of duplicate primary keys, we wish ordering on the
         secondary key (first name).

      3. In the case of ties on both keys, we wish ordering on the tertiary
         key (middle initial).

      etc - to any number of keys.

   B. The approach we will discuss here applies to BOTH INTERNAL AND EXTERNAL
      SORTS.

   C. There are two techniques that can be used for cases like this:

      1. We can modify an existing algorithm to consider multiple keys when
         it does comparisons - e.g.

         a. Original algorithm says:

                if (item[i].key < item[j].key)

         b. Revised algorithm says:

                if ((item[i].primary_key < item[j].primary_key) ||

                   ((item[i].primary_key == item[j].primary_key) &&
                    (item[i].secondary_key < item[j].secondary_key) ||

                   ((item[i].primary_key == item[j].primary_key) &&
                    (item[i].secondary_key == item[j].secondary_key) &&
                    (item[i].tertiary_key < item[j].tertiary_key)) )

      2. We can sort the same file several times, USING A STABLE SORT.

         a. First sort is on least significant key.
         b. Second sort is on second least significant key.
         c. Etc.
         d. Final sort is on primary key.

      3. The first approach is usable when we are embedding a sort in a
         specific application package; the second is more viable when we are
         building a utility sorting routine for general use [but note that we
         are now forced to a stable algorithm.]

VI. Pointer Based Sorting
--  ------- ----- -------

   A. When the items being sorted are large records (perhaps hundreds of
      bytes each, it may be desirable to use a pointer-based approach to
      reduce the time spent moving data.  The following are some variants
      on this theme.

   B. ADDRESS TABLE SORTING: we use an array of pointers P[1]..P[N].  Instead
      of physically rearranging the records (which is costly in terms of data
      movement time), we leave the records in their original place and
      sort the array of pointers so that: P[i]^.Key <= P[j]^.Key
      for all i <= j.  

   C. KEY SORTING: if the key is short relative to the whole record, then
      we sort an array  consisting of keys plus pointers to the rest of the
      record, so that we only move keys and pointers, not whole records.
      At the very end of the sort, we may physically rearrange the records
      themselves.

   D. LIST SORTING: we keep the records on a linked list, and rearrange
      links rather than moving records.  (We will use this in several
      of the algorithms below.)  Again, at the very end of the sort, we may
      physically rearrange the records themselves.