How to efficiently check whether it's height balanced for a massively skewed binary search tree?

Question

I was reading this answer on how to check if a BST is height balanced, and really hooked by the bonus question:

Suppose the tree is massively unbalanced. Like, a million nodes deep on one side and three deep on the other. Is there a scenario in which this algorithm blows the stack? Can you fix the implementation so that it never blows the stack, even when given a massively unbalanced tree?

What would be a good strategy here?

I am thinking to do a level order traversal and track the depth, if a leaf is found and current node depth is bigger than the leaf node depth + 2, then it's not balanced. But how to combine this with height checking?

Edit: below is the implementation in the linked answer

IsHeightBalanced(tree)
    return (tree is empty) or 
           (IsHeightBalanced(tree.left) and
           IsHeightBalanced(tree.right) and
           abs(Height(tree.left) - Height(tree.right)) <= 1)

"Fix the implementation" - which one? Please post the code you're referring to. — Bergi, Apr 18 '14 at 19:26
The largest height-balanced tree that fits in a 64-bit address space has height much smaller than one million. — David Eisenstat, Apr 18 '14 at 19:29
You probably should first finish "Bonus exercise" before you try "Super bonus exercise". — Bernhard Barker, Apr 18 '14 at 19:58
@Bergi updated, this implementation depends on `Height()` which implies additional traversal. I am aware of how to do it in one traversal without calling `Height()`, the real problem is how to stop early. — Skiptomylu, Apr 18 '14 at 20:00
@Dukeling I can modify the function to return a tuple which contains a boolean and a height, hence reduce to one traversal. — Skiptomylu, Apr 18 '14 at 20:05
Define "blow the stack". You can always use a custom sized stack instead of the standard stack provided by the OS. Another interpretation would be that you want to do it in logarithmic space, but then how do you do level-order traversal? It's not really straightforward — Niklas B., Apr 18 '14 at 20:52
If you're willing to change pointers temporarily, you can get the space usage of a traversal down to O(1), but take it from someone who's debugged a couple binary tree implementations: you *really* want to use an external set to detect all of the other topological abnormalities. — David Eisenstat, Apr 19 '14 at 00:06
@DavidEisenstat Can you elaborate a bit on that approach (possibly in an answer, as I guess it would answer the question)? — Bernhard Barker, Apr 19 '14 at 12:00
@Dukeling It only works on one of the two definitions of balance, and it involves an algorithm whose details I had forgotten. See my answer. — David Eisenstat, Apr 19 '14 at 16:51

score 1 · Answer 1 · edited May 23 '17 at 10:32

To review briefly: a tree is defined as being either null or a root node with pointers .left to a left child and .right to a right child, where each child is in turn a tree, the root node appears in neither child, and no node appears in both children. The depth of a node is the number of pointers that must be followed to reach it from the root node. The height of a tree is -1 if it's null or else the maximum depth of a node that appears in it. A leaf is a node whose children are null.

First let me note the two distinct definitions of "balanced" proposed by answerers of the linked question.

EL-balanced A tree is EL-balanced if and only if, for every node v, |height(v.left) - height(v.right)| <= 1.

This is the balance condition for AVL trees.

DF-balanced A tree is DF-balanced if and only if, for every pair of leaves v, w, we have |depth(v) - depth(w)| <= 1. As DF points out, DF-balance for a node implies DF-balance for all of its descendants.

DF-balance is used for no algorithm known to me, though the balance condition for binary heaps is very similar, requiring additionally that the deeper leaves be as far left as possible.

I'm going to outline three approaches to testing balance.

Size bounds for balanced trees

Expand the recursive function to have an extra parameter, maxDepth. For each recursive call, pass maxDepth - 1, so that maxDepth roughly tracks how much stack space is left. If maxDepth reaches 0, report the tree as unbalanced (e.g., by returning "infinity" for the height), since no balanced tree that fits in main memory could possibly be that tall.

This approach relies on an a priori size bound on main memory, which is available in practice if not in all theoretical models, and the fact that no subtrees are shared. (PROTIP: unless you're very careful, your subtrees will be shared at some point during development.) We also need height bounds on balanced trees of at most a given size.

EL-balanced Via mutual induction, we prove a lower bound, L(h), on the number of nodes belonging to an EL-balanced tree of a given height h.

The base cases are

L(-1) = 0
L(0) = 1,

more or less by definition. The inductive case is trickier. An EL-balanced tree of height h > 0 is a node with an EL-balanced child of height h - 1 and another EL-balanced child of height either h - 1 or h - 2. This means that

L(h) = 1 + L(h - 1) + min(L(h - 2), L(h - 1)).

Add 1 to both sides and rearrange.

L(h) + 1 = L(h - 1) + 1 + min(L(h - 2) + 1, L(h - 1) + 1).

A little while later (spoiler), we find that

L(h) <= phi^(h + 2)/sqrt(5),
where phi = (1 + sqrt(5))/2 ~ 1.618.

maxDepth then should be set to the floor of the base-phi logarithm of the maximum number of nodes, plus a small constant that depends on fenceposty things.

DF-balanced Rather than write out an induction proof, I'm going to appeal to your intuition that the worst case is a complete binary tree with one extra leaf on the bottom. Then the proper setting for maxDepth is the base-2 logarithm of the maximum number of nodes, plus a small constant.

Iterative deepening depth-first search

This is the theoretician's version of the answer above. Because, for some reason, we don't know how much RAM our computer has (and with logarithmic space usage, it's not as though we need a tight bound), we again include the maxDepth parameter, but this time, we use it to truncate the tree implicitly below the specified depth. If the height of the tree comes back below the bound, then we know that the algorithm ran successfully. Alternatively, if the truncated tree is unbalanced, then so is the whole tree. The problem case is when the truncated tree is balanced but with height equal to maxDepth. Then we increase maxDepth and retry.

The simplest retry strategy is to increase maxDepth by 1 every time. Since balanced trees with n nodes have height O(log n), the running time is O(n log n). In fact, for DF-balanced trees, the running time is also O(n), since, except for the last couple traversals, the size of the truncated tree increases by a factor of 2 each time, leading to a geometric series.

Another strategy, doubling maxDepth each time, gives an O(n) running time for EL-balanced trees, since the largest tree of height h, with 2^(h + 1) - 1 nodes, is much smaller than the smallest tree of height 2h, with approximately (phi^2)^h nodes. The downside of doubling is that we may use twice as much stack space. With increase-by-1, however, in the family of minimum-size EL-balanced trees we constructed implicitly in defining L(h), the number of nodes at depth h - k in the tree of height h is polynomial of degree k. Accordingly, the last few scans will incur some superlinear term.

Temporarily mutating pointers

If there are parent pointers, then it's easy to traverse depth-first in place, because the parent pointers can be used to derive the relevant information on the stack in an efficient manner. If we don't have parent pointers but can mutate the tree temporarily, then, for descent into a child, we can cannibalize the pointer to that child to store temporarily the node's parent. The problem is determining on the way up whether we came from a left or a right child. If we can sneak a bit (say because pointers are 2-byte aligned, or because there's a spare bit in the balance factor, or because we're copying the tree for stop-and-copy garbage collection and can determine which arena we're in), then that's one way. Another test assumes that the tree is a binary search tree. It turns out that we don't need additional assumptions, however: Explain Morris inorder tree traversal without using stacks or recursion .

The one fly in the ointment is that this approach only works, as far as I know, on DF-balance, since there's no space on the stack to put the partial results for EL-balance.

How to efficiently check whether it's height balanced for a massively skewed binary search tree?

1 Answers1

Size bounds for balanced trees

Iterative deepening depth-first search

Temporarily mutating pointers