CPS222 Lecture: Introduction to Trees and Forests
                                                        Last revised 1/18/2013

Objectives:

1. To define "tree" and "forest"
2. To introduce basic operations on trees (e.g. traversals)
3. To show how trees and forests can be represented as binary trees

Materials:

1. Excerpts from an "array of pointers to children" implementation (to project)
2. Excerpts from an "oldest child/next sibling" representation (to project)

I. Introduction
-  ------------

   A. Our discussion of data structures has focussed on sequential structures
      (arrays, stacks, queues, lists etc.).  Now we want to move to a 
      consideration of branching structures, in which each element of the 
      structure can have more than one "successor".

   B. The most general sort of branching structure is the graph, which we shall
      consider later.  First, though, we want to give considerable attention to
      a particularly useful class of branching structures: trees.

   C. Definition: A tree is a set of nodes, consisting of a special node-called
      the root - and 0 or more disjoint subsets, each of which is a tree.

      1. ex:                    A
                            /   |   \
                        B       C       E
                                |     /   \
                                D     F   G
                                          |
                                          H

        - the set of nodes A .. H is a tree.  A is the root, and the
          subtrees are B, C .. D, and E .. H.

          in the subtree B, B is the root and there are no subtrees.
          in the subtree C..D, C is the root, E is the subtree.  E in turn
            is the root of a tree with no subtrees
          in the subtree E..H, E is the root, F and G are the roots of two
            subtrees, one of which (F) has no subtrees of its own, and the
            other of which (G) has the subtree H.

      2. Note well the insistence that the subtrees be disjoint.  For
         example:
                        A
                      /   \
                     B     C
                      \   /  \
                        D     E

         is not a tree.

      3. This definition differs slightly from the one in the book - though
         it is basically saying the same thing
         
         a. A tree cannot be empty - it must at least have a root node.
         
         b. The first of of the two definitions in the book was in terms of
            the parent relationship, rather than subtree.  (But the book also
            gave a second definition like the one above.)
        
   D. Some terminology:

      1. Tree terminology is borrowed from two portions of the natural world:

         a. Wood type trees: we speak of the "root" of a tree and of its
            "leaves".  We have already defined the notion of "root" (but
            notice that we draw it on the top, not on the bottom!)  A leaf of
            a tree is the root of a (sub)tree that has no subtrees of its own.

         b. Geneaological trees (family trees):

            (1) If A is the root of a tree and B is the root of one of its 
                subtrees, then we say that A is the "father" or "parent" of B, 
                and B is the "son" or "child" of A.  In the above:

                - A is the parent of B, C, and E.  B,C, and E are children of A.
                - C is the parent of D, D is the child of C.
                - E is the parent of F and G; F and G are children of E.

            (2) We can carry this further, speaking of A as the grandparent of
                D etc.  In general, we say that A is the "ancestor" of H and
                H is the "descendant" of A if H is in one of the subtrees of A.
                In the example above, B, C, D, E, F, G, and H are all 
                descendants of A.
 
            (3) If two nodes are the children of the same parent), we say that 
                they are "brothers" or "siblings" or (sometimes) "twins".  In 
                the above, B, C, and E are siblings, as are F and G.

            (4) We could go farther and use terms like "uncle" - but we
                seldom do.

      2. Additional terminology:

         a. The leaves of a tree are sometimes also called "external" or
            "terminal" nodes, and the non-leaf nodes can be called "internal"
            or "non-terminal" nodes.

         b. The "degree" of a node is the number of children it has.  (Note that
            we can then define a leaf as a node with degree 0.)  The degree of a
            tree is the maximum degree of any of its nodes.  In the above 
            example, the degree of A is three - and this also happens to be the 
            degree of the whole tree, since the next highest degree is two.  It 
            need not always be the case that the root has the highest degree.

         c. A "path" from the root of a tree to a node is a sequence of nodes
            N .. N  such that N  is the root, N  is the leaf, and N  is the
             1    hi           1               h                   i
            parent of N    for all i, 1 <= i < h.  The length of a path is
                       i+1
            the number of EDGES traversed - i.e. one less than the number of
            nodes on the path.

         d. The "depth" or "level" of a node can be defined as follows:

            - The depth level of a node is its distance from the root - the
              length of a path from the root to it.
              
         or - equivalently:
         
            - The depth (level) of the root of a tree is zero.
            - The depth (level) of any other node is 1 + the depth (level) of 
              its parent.
            - In the above: A is at depth 0, B, C, and E at depth 1, D, F, and
              G at depth 2, and H at depth 3.

            But note: Some authors define the depth of the root of a tree to be
            1, not 0.  The effect, in the above example, would to make each
            value one greater.

         e. The "height" of a node is the length of the longest path from
            that node to a leaf.  This can be done by counting nodes or edges -
            which leads to two different answers that differ by 1.
            
            - If we count edges, then leaf nodes have height 0.  
            - If we count nodes, leaf nodes have a height of 1.
            
            i. In either case, the height of any other node is 1 + the maximum 
               of the heights of its children.  The height of a tree is defined 
               to be the height of the root.

           ii. The book uses the "edges" form of definition, which leads to a
               single node tree (just a root) having a height of 0.  The "nodes"
               form of definition is more intuitive, I think.  For example, a
               single node tree would have a height of 1.
               
          iii. I'll use the latter definition in subsequent lectures.

      3. In drawing our tree examples, there has been an implicit left-to-right
         ordering of the children of a given parent.  In an actual tree, this
         ordering may or may not be an important.  An "ordered" tree is one
         in which there is such an ordering imposed on the children of the
         same parent; in an "unordered" tree, no such relationship exists.

         a. Note that any practical scheme for representing a tree imposes an
            order.

         b. In our further discussion, we will work with ordered trees unless
            we explicitly say otherwise - though most of what we say about
            ordered trees applies equally to unordered trees.

         c. Sometimes, when we are thinking of a tree as an ordered tree,
            we will say of two siblings that the first is "older" than
            the second if the first is to the left of the second in our
            drawing.  We can then use the term "oldest child" to refer to
            the leftmost child of a node.  

            Example: In the tree we have been using for examples, B is
                     the oldest child of A, C the oldest, and E the youngest.

   E. To further generalize, we can define the concept of a "forest" as a
      set of 0 or more disjoint trees.  

      1. Example:
                        B       C       E
                                |     /   \
                                D     F   G
                                          |
                                          H

      2. Observe: we can convert a forest to a tree by adding a single node
         to serve as the root of a tree in which each of the original trees
         is a subtree:

        ex:                     A
                            /   |   \
                        B       C       E
                                |     /   \
                                D     F   G
                                          |
                                          H
 
      3. Conversely, deleting the root from a tree leaves behind a forest
         consisting of its subtrees.  (Obviously, this is how we got our
         forest from our original tree.)

   F. In writing about trees, we can adopt one of several systems of notation:

      1. The graph-like drawings we have been using thus far.

      2. Indentation:

         ex: Our original tree:

                A
                  B
                  C
                    D
                  E
                    F
                    G
                      H

         ex: Our forest:

                B
                C
                  D
                E
                  F
                  G
                    H

      3. Parentheses.  ex: our tree

                A(B, C(D), E(F, G(H)))

   G. Some uses of trees: Observe that a tree is a fundamentally hierarchical 
      structure.  Thus, a tree is appropriate to model any reality that
      exhibits hierarchy:

      1. File system directories are often tree-structured.

      2. Geneaological trees of all sorts: family relationships among
         individuals, tribes, languages etc.

      3. Classifications systems:

         a. Taxonomic classification of plants and animals.

         b. Dewey decimal (or Library of Congress) classification of books.

      4. Breakdown of a manufactured product into subassemblies, each of
         turn consists of sub-subassemblies etc. down to the smallest
         components.

      5. Structure of a program - main routine is the root, procedures it
         contains are subtrees, each of which contains nested procedure
         definitions etc.

   H. Trees are also very useful for information storage and retrieval
      situations such as symbol tables, even though hierarchy may not be
      involved.

II. Operations on trees
--  ---------- -- -----

   A. As with any flexible data structure, there are many possible operations
      we could define on trees.  Certainly, we want a create operation - but
      note that there is no such thing as an empty tree!  So when we create
      a tree, we create a tree having at least one node - the root.

   B. The operation of insertion into a tree is certainly important, but
      depends heavily on the principle by which the nodes are organized.
      We defer discussion of insertion and deletion to discussion of various
      special kinds of tree organized on various principles.

   C. One class of operations that can be defined for all kinds of tree is
      traversal.  By "traversal", we mean the act of systematically
      "visiting" all of the nodes to perform some operation on them:

      1. Printing out the contents of all of the nodes, or performing some other
         operation on all the nodes, involves a traversal.

      2. Unless the tree is ordered somehow on the basis of some key,
         searching for a node containing a given value would involve a
         traversal (though in practice trees that are to be searched are
         usually structured in such a way as to avoid this.)

   D. One issue that arises in connection with traversal is the order of
      traversal.  Two orders are of particular importance:

      1. Preorder traversal:    Visit the root of the tree
                                Traverse each subtree in turn in preorder
         Example on the above:  A B C D E F G H

      2. Postorder traversal:   Traverse each subtree in postorder
                                Visit the root
         Example on the above:  B D C F H G E A

   E. Of lesser importance is level order traversal: visit all the nodes
      on level zero, then all on level one etc.
      
        Example on the above:   A B C E D F G H

   F. The above operations can be defined on a forest by mentally adding a
      root which is ignored when it comes time to visit it.

III. Representing Trees and Forests
---  ------------ ----- --- -------

   A. We have noted that a forest can be converted to a tree by adding a
      root.  Thus we focus on representing trees - to represent a forest,
      simply include a "root" as a header.

   B. One method is to use a linked representation in which each node contains
      pointers to its children.  This means that when we define the data type
      for a node, the degree of the tree determines the number of pointer
      fields needed.  Pointer fields in a given node that are not needed can
      be set to null.

      PROJECT: Array of pointers to children example - class Node

      1. Now, for example, we could implement operations on this tree as follows:

         a. preorder traversal:

            PROJECT: preorder

         b. postorder traversal could be written similarly.  What changes would
            be needed to turn the given preorder code into postorder?

            ASK

           - Change the name of the function!
           - Do the visit AFTER the recursive calls

         c. Reading a tree in from a text file.  Assume that the nodes of a
            tree have been written out, one node to a line, in pre-order.
            Assume each line contains the contents of the node and the number
            of its children.  

            ex:     The tree        A
                                 /  |  \
                                B   C   D
                                        /\
                                        E F
            would be stored as:    A 3
                                   B 0
                                   C 0
                                   D 2
                                   E 0
                                   F 0

            PROJECT readTree code

      2. However, this representation runs into a severe efficiency problem if
         the degree of the tree is large.  

         a. Thm: For a tree of degree d with n nodes, represented using the
            array of pointers to children representation, we will always have 
            n*(d-1) + 1 NULL pointers stored in the nodes.

            Pf: Each of the n nodes has room for d pointers - or n*d pointers
                in all.  Each node (except the root) is pointed to by exactly
                one of these.  So n-1 pointers are used to point to other
                nodes, leaving n*d - (n-1) = n*(d-1) + 1 NULL.

         b. For example, for a tree of degree 10 with 100 nodes, we waste 901
            pointers.

   C. An alternate representation can be arrived at by using a linked list
      representation for the children of a node.  

      1. Each node holds two pointers.  One points to its oldest child.
         The other points to its next sibling (next younger node with the
         same parent.)

      2. Such a tree is actually a binary tree.  A binary tree is either 
         empty, or it consists of a root and exactly two disjoint sets of 
         nodes - designated left child and right child, each of which is a 
         binary tree. We will say more about binary trees in the next lecture -
         for now note that a binary tree is a different thing from a tree! 

      3. The transformation from a general tree into an equivalent binary
         tree (oldest child/next sibling representation) can be done
         recursively, as follows:

         a. To transform a general tree rooted at a node A to its equivalent
            binary tree:

           - create a binary tree whose root is A.
           - transform the leftmost subtree of A in the general tree, and make
             this the left subtree of A in the binary tree.
           - transform the next sibling of A in the general tree, and make this
             the right subtree of A in the binary tree..

         b. ex: our original tree:

                A
              /
             B
              \
               C
              /  \
             D    E
                 /
                F
                 \
                  G
                 /
                H

         c. Note that you can visualize the shape of the original tree by 
            mentally rotating the binary equivalent 45 degrees counterclockwise.

         d. The same method can be applied to a forest - the right subtree of 
            the binary equivalent of the root of one of the trees is the 
            transformed version of the next tree in the forest.  We can see 
            what this would look like for our example forest by just deleting 
            the A node from the above tree.

         PROJECT: Code for Oldest child/next sibling representation - NODE class

      4. Note that this representation dramatically decreases the number
         of NULL pointers.  If we used the same reasoning we used previously,
         an n-node tree would need just n + 1 NULL pointers.

      5. Performing traversals on a general tree represented by an equivalent
         binary tree.

         a. Preorder traversal of the general tree is accomplished by preorder
            traversal of the transformed tree.

            ex: preorder traversal of the above binary tree: A B C D E F G H
        
            PROJECT: Code for preorder

         b. What about postorder traversal?  How would this be done?
         
            ASK
         
            i. Postorder traversal of the general tree is accomplished by 
               INORDER traversal of the transformed tree.

               Inorder traversal:  traverse the left subtree in inorder
                                   visit the root
                                   traverse the right subtree in inorder

           ii. ex: the above:      B D C F H G E A

         iii. This works because:

               - The left subtree of any node in the transformed tree contains all
                 the nodes that were descendants of that node in the original
                 tree.  These should be visited first.
               - The right subtree of any node in the transformed tree contains
                 all the nodes that were right siblings (or descendants thereof)
                 of the node in the original tree. These should be visited after
                 the node.

          iv. What would need to be done to change the example code for preorder
              just projected to do this?

              ASK

              - Change the name
              - Do the visit between subtrees

        c. Postorder traversal of the transformed tree has no relationship to
           any meaningful operation on the original tree.

      d. An equivalent to our ReadTree procedure defined above can also be done

         PROJECT: Code for readTree