Why is the number of sub-trees gained from a range tree query is O(log(n))?

Question

I'm trying to figure out this data structure, but I don't understand how can we tell there are O(log(n)) subtrees that represents the answer to a query?

Here is a picture for illustration:

Thanks!

Because if we split, it is possible that we have to "cut off" subparts of the trees. For a complete tree, there are *O(log n)* levels, and for each of this level we either return the full node, or an edited one where we recurse one at most one subtree. — Willem Van Onsem, Nov 17 '18 at 12:58

score 0 · Answer 1 · answered Nov 17 '18 at 15:43

If we make the assumption that the above is a purely functional binary tree [wiki], so where the nodes are immutable, then we can make a "copy" of this tree such that only elements with a value larger than x₁ and lower than x₂ are in the tree.

Let us start with a very simple case to illustrate the point. Imagine that we simply do not have any bounds, than we can simply return the entire tree. So instead of constructing a new tree, we return a reference to the root of the tree. So we can, without any bounds return a tree in O(1), given that tree is not edited (at least not as long as we use the subtree).

The above case is of course quite simple. We simply make a "copy" (not really a copy since the data is immutable, we can just return the tree) of the entire tree. So let us aim to solve a more complex problem: we want to construct a tree that contains all elements larger than a threshold x₁. Basically we can define a recursive algorithm for that:

the cutted version of None (or whatever represents a null reference, or a reference to an empty tree) is None;
if the node has a value is smaller than the threshold, we return a "cutted" version of the right subtree; and
if the node has a value greater than the threshold, we return an inode that has the same right subtree, and as left subchild the cutted version of the left subchild.

So in pseudo-code it looks like:

def treelarger(some_node, min):
    if some_tree is None:
        return None
    if some_node.value > min:
        return Node(treelarger(some_node.left, min), some_node.value, some_node.right)
    else:
        return treelarger(some_node.right, min)

This algorithm thus runs in O(h) with h the height of the tree, since for each case (except the first one), we recurse to one (not both) of the children, and it ends in case we have a node without children (or at least does not has a subtree in the direction we need to cut the subtree).

We thus do not make a complete copy of the tree. We reuse a lot of nodes in the old tree. We only construct a new "surface" but most of the "volume" is part of the old binary tree. Although the tree itself contains O(n) nodes, we construct, at most, O(h) new nodes. We can optimize the above such that, given the cutted version of one of the subtrees is the same, we do not create a new node. But that does not even matter much in terms of time complexity: we generate at most O(h) new nodes, and the total number of nodes is either less than the original number, or the same.

In case of a complete tree, the height of the tree h scales with O(log n), and thus this algorithm will run in O(log n).

Then how can we generate a tree with elements between two thresholds? We can easily rewrite the above into an algorithm treesmaller that generates a subtree that contains all elements that are smaller:

def treesmaller(some_node, max):
    if some_tree is None:
        return None
    if some_node.value < min:
        return Node(some_node.left, some_node.value, treesmaller(some_node.right, max))
    else:
        return treesmaller(some_node.left, max)

so roughly speaking there are two differences:

we change the condition from some_node.value > min to some_node.value < max; and
we recurse on the right subchild in case the condition holds, and on the left if it does not hold.

Now the conclusions we draw from the previous algorithm are also conclusions that can be applied to this algorithm, since again it only introduces O(h) new nodes, and the total number of nodes can only decrease.

Although we can construct an algorithm that takes the two thresholds concurrently into account, we can simply reuse the above algorithms to construct a subtree containing only elements within range: we first pass the tree to the treelarger function, and then that result through a treesmaller (or vice versa).

Since in both algorithms, we introduce O(h) new nodes, and the height of the tree can not increase, we thus construct at most O(2 h) and thus O(h) new nodes.

Given the original tree was a complete tree, then it thus holds that we create O(log n) new nodes.

Apologies if I've misinterpreted your answer, but I don't see how the purely functional approach you're mentioning connects to the way of counting how many subtrees get picked up in a search. Can you elaborate on that a bit? — templatetypedef, Jun 12 '20 at 21:05
@templateltypedef: it means that we do not need to copy the nodes in "the middle", since we simply refer to these, like a *finger tree* for example. We thus only need to make new nodes for for the two edges in the "triangle", and can reuse all other nodes in the tree. — Willem Van Onsem, Jun 13 '20 at 18:17

score 0 · Answer 2 · answered Jun 12 '20 at 21:04

Consider the search for the two endpoints of the range. This search will continue until finding the lowest common ancestor of the two leaf nodes that span your interval. At that point, the search branches with one part zigging left and one part zagging right. For now, let's just focus on the part of the query that branches to the left, since the logic is the same but reversed for the right branch.

In this search, it helps to think of each node as not representing a single point, but rather a range of points. The general procedure, then, is the following:

If the query range fully subsumes the range represented by this node, stop searching in x and begin searching the y-subtree of this node.
If the query range is purely in range represented by the right subtree of this node, continue the x search to the right and don't investigate the y-subtree.
If the query range overlaps the left subtree's range, then it must fully subsume the right subtree's range. So process the right subtree's y-subtree, then recursively explore the x-subtree to the left.

In all cases, we add at most one y-subtree in for consideration and then recursively continue exploring the x-subtree in only one direction. This means that we essentially trace out a path down the x-tree, adding in at most one y-subtree per step. Since the tree has height O(log n), the overall number of y-subtrees visited this way is O(log n). And then, including the number of y-subtrees visited in the case where we branched right at the top, we get another O(log n) subtrees for a total of O(log n) total subtrees to search.

Hope this helps!

Why is the number of sub-trees gained from a range tree query is O(log(n))?

2 Answers2