I'm trying to figure out this data structure, but I don't understand how can we tell there are O(log(n)) subtrees that represents the answer to a query?
Here is a picture for illustration:
Thanks!
I'm trying to figure out this data structure, but I don't understand how can we tell there are O(log(n)) subtrees that represents the answer to a query?
Here is a picture for illustration:
Thanks!
If we make the assumption that the above is a purely functional binary tree [wiki], so where the nodes are immutable, then we can make a "copy" of this tree such that only elements with a value larger than x1 and lower than x2 are in the tree.
Let us start with a very simple case to illustrate the point. Imagine that we simply do not have any bounds, than we can simply return the entire tree. So instead of constructing a new tree, we return a reference to the root of the tree. So we can, without any bounds return a tree in O(1), given that tree is not edited (at least not as long as we use the subtree).
The above case is of course quite simple. We simply make a "copy" (not really a copy since the data is immutable, we can just return the tree) of the entire tree. So let us aim to solve a more complex problem: we want to construct a tree that contains all elements larger than a threshold x1. Basically we can define a recursive algorithm for that:
None
(or whatever represents a null reference, or a reference to an empty tree) is None
;So in pseudo-code it looks like:
def treelarger(some_node, min):
if some_tree is None:
return None
if some_node.value > min:
return Node(treelarger(some_node.left, min), some_node.value, some_node.right)
else:
return treelarger(some_node.right, min)
This algorithm thus runs in O(h) with h the height of the tree, since for each case (except the first one), we recurse to one (not both) of the children, and it ends in case we have a node without children (or at least does not has a subtree in the direction we need to cut the subtree).
We thus do not make a complete copy of the tree. We reuse a lot of nodes in the old tree. We only construct a new "surface" but most of the "volume" is part of the old binary tree. Although the tree itself contains O(n) nodes, we construct, at most, O(h) new nodes. We can optimize the above such that, given the cutted version of one of the subtrees is the same, we do not create a new node. But that does not even matter much in terms of time complexity: we generate at most O(h) new nodes, and the total number of nodes is either less than the original number, or the same.
In case of a complete tree, the height of the tree h scales with O(log n), and thus this algorithm will run in O(log n).
Then how can we generate a tree with elements between two thresholds? We can easily rewrite the above into an algorithm treesmaller
that generates a subtree that contains all elements that are smaller:
def treesmaller(some_node, max):
if some_tree is None:
return None
if some_node.value < min:
return Node(some_node.left, some_node.value, treesmaller(some_node.right, max))
else:
return treesmaller(some_node.left, max)
so roughly speaking there are two differences:
some_node.value > min
to some_node.value < max
; andright
subchild in case the condition holds, and on the left if it does not hold.Now the conclusions we draw from the previous algorithm are also conclusions that can be applied to this algorithm, since again it only introduces O(h) new nodes, and the total number of nodes can only decrease.
Although we can construct an algorithm that takes the two thresholds concurrently into account, we can simply reuse the above algorithms to construct a subtree containing only elements within range: we first pass the tree to the treelarger
function, and then that result through a treesmaller
(or vice versa).
Since in both algorithms, we introduce O(h) new nodes, and the height of the tree can not increase, we thus construct at most O(2 h) and thus O(h) new nodes.
Given the original tree was a complete tree, then it thus holds that we create O(log n) new nodes.
Consider the search for the two endpoints of the range. This search will continue until finding the lowest common ancestor of the two leaf nodes that span your interval. At that point, the search branches with one part zigging left and one part zagging right. For now, let's just focus on the part of the query that branches to the left, since the logic is the same but reversed for the right branch.
In this search, it helps to think of each node as not representing a single point, but rather a range of points. The general procedure, then, is the following:
If the query range fully subsumes the range represented by this node, stop searching in x and begin searching the y-subtree of this node.
If the query range is purely in range represented by the right subtree of this node, continue the x search to the right and don't investigate the y-subtree.
If the query range overlaps the left subtree's range, then it must fully subsume the right subtree's range. So process the right subtree's y-subtree, then recursively explore the x-subtree to the left.
In all cases, we add at most one y-subtree in for consideration and then recursively continue exploring the x-subtree in only one direction. This means that we essentially trace out a path down the x-tree, adding in at most one y-subtree per step. Since the tree has height O(log n), the overall number of y-subtrees visited this way is O(log n). And then, including the number of y-subtrees visited in the case where we branched right at the top, we get another O(log n) subtrees for a total of O(log n) total subtrees to search.
Hope this helps!