How to return a node, uniformly at random, from a binary search tree?

Question

Given a BST (may or may not be balanced) how can one return "any" node uniformly at random? A constraint is you cannot use an external indexing data structure. You must traverse the tree in such a manner that every node has an equal chance of being visited.

This question has me perplexed for quite a while. If we can indeed use an external hashtable/pointers we could just randomize on those and return the corresponding node. However, my colleague has put forth a rather complex variant of the question, where no additional data structures can be used.

A simple random walk with a 50-50 chance of either going L/R doesn't work since the probability of returning a node near the bottom of the tree is much smaller(probabilities compound)
Even randomly generating a depth d and traversing at most d nodes to return the node (or stopping if it's a leaf) doesn't generate a uniform distribution.

Update: You cannot have an inorder walk and store the result in an array either.

How can one achieve such a traversal?

Can you just generate a random number between 0 and the number of nodes in the tree, then traverse the tree in order such that as soon as you have moved from one node to another "random number" amount of times, you stop at that node and return? That seems too simple so maybe thats not what you are looking for :/ — Josh Engelsma, Sep 19 '14 at 20:16
If an inorder work isn't allowed then the question is ludicrous — Tim B, Sep 19 '14 at 20:40
Yep, sampling uniformly in a plain old binary search tree without visiting every node is flat out impossible. — David Eisenstat, Sep 19 '14 at 21:06
@DavidEisenstat: That's probably not what was meant by "and" in that prohibition. In any event, obviously you have to visit every node (unless you know how many nodes there are in advance) but you don't have to visit them in order. — rici, Sep 19 '14 at 21:09

rici · Accepted Answer · 2014-09-19T22:26:40.317

Walk the tree in any order, keeping the following values:

N: the number of nodes seen
selected: the currently selected node.

Initially, N is 0 and selected is None. Visiting a node consists of the following:

Increment N
Generate a random integer in the range [0, N).
If the random integer selected is 0, set selected to the current Node.

Note that the values N and selected need to be modified during the walk. That means that they are both input and output values to the visitor function.

At the end of the walk, N will be the number of nodes in the tree, and selected will be a random node selected with uniform probability (assuming you have a good random number generator).

This algorithm is not restricted to BSTs. It will work on any tree of any shape. In particular, it will work on a simple linear sequence of objects of unknown length, corresponding to the well-known random selection algorithm which is to iterate over the objects, replacing the selected random object with the newly visited one with probability 1/N where N is the number of objects seen to date.

If you keep track of visited nodes, it will also work on any connected graph.

If you have a very large tree (or graph), perhaps spread over a number of servers and/or storage devices, you can use a different presentation of this algorithm, which provides for a certain level of parallelism (and also prevents the need to keep a global walk structure or pass values into the walk).

We assume that each node-server has direct access to k objects and indirect access to some known number of children servers. The algorithm allows for redundant children, but assumes that network communication is (almost) perfect; dealing with network splits is outside of the scope of this answer. We also assume that every query has an associated unique query number, which allows us to deal with some networking artifacts. The query has no other information (other than the server to respond to), and is expected to return a tuple consisting of a count and a randomly-selected node.

When a node-server receives a query with id q, it does the following:

If it has previously responded to query q, immediately return <0, null>
Set count to k and selected to a randomly-selected object from the k objects it has direct access to.
For each child server, send the query (with the same query-id)
For each response returned (it doesn't matter what order the responses come in):

a. Add response.count to count

b. With probability response.count / count, replace selected with response.selected
When all children servers have responded, return <count, selected>

Nice algorithm, it is worthwhile to mention that you're using Reservoir Sampling. :) — Tuxdude, Jun 04 '16 at 20:53
@Tuxdude: Or a variation on the theme, in the case of the parallel multi-server algorithm. — rici, Jun 04 '16 at 20:58
@rici Isn't there a chance that the step "Generate a random integer in the range [0, N)" never generates 0? In that case, selected will always be null/none. — codewarrior, Mar 04 '18 at 08:47
@codewarrior: nope. the first time step 2 is executed, `N` is `1`, and "a random integer in the range `[0, N)`" can only be `0`. — rici, Mar 04 '18 at 12:11

tmyklebu · Answer 2 · 2014-09-19T20:22:47.773

If you know the number of nodes n in the tree, find the kth node in an inorder walk of the tree for some randomly-chosen k between 0 and n-1, inclusive. If you don't, you can walk the tree and figure out its size and then do the above.

If the nodes of the tree are stored contiguously in some array, randomly sample an array element and return it.

If each tree node can tell you the size of the subtree rooted there, figure out how many nodes are in the tree, generate a random k up to the number of nodes in the tree, and select the kth element of the tree.

In all cases above, the "tree" part is a red herring.

There is a random walk on any nontrivial connected graph whose stationary distribution is uniform. Select a neighbour of the current node. If it has lower or equal degree, go there. If it has higher degree, go there with probability cur_deg / that_deg and stay put otherwise. Sampling from this random walk is called in various contexts "Gibbs sampling" and "Metropolis-Hastings."

http://en.wikipedia.org/wiki/Reservoir_sampling . So there's no need to know the size in advance. — rici, Sep 19 '14 at 21:07

score 0 · Answer 3 · answered Sep 19 '14 at 21:11

If you have a full (bottom level is completely full) binary search tree then it is possible to quickly (sub-linear time) do as you ask, because you know the structure of the tree. If you want me to post an answer that gives the solution for this case, let me know and I'll update my answer. However, if you just have a generic binary search tree of arbitrary shape, then it is impossible to uniformly sample a node without visiting the entire tree so that you know the shape.

How to return a node, uniformly at random, from a binary search tree?

3 Answers3

Linked