Walk the tree in any order, keeping the following values:
Initially, N
is 0 and selected
is None
. Visiting a node consists of the following:
Increment N
Generate a random integer in the range [0, N)
.
If the random integer selected is 0, set selected
to the current Node.
Note that the values N
and selected
need to be modified during the walk. That means that they are both input and output values to the visitor function.
At the end of the walk, N
will be the number of nodes in the tree, and selected
will be a random node selected with uniform probability (assuming you have a good random number generator).
This algorithm is not restricted to BSTs. It will work on any tree of any shape. In particular, it will work on a simple linear sequence of objects of unknown length, corresponding to the well-known random selection algorithm which is to iterate over the objects, replacing the selected random object with the newly visited one with probability 1/N
where N
is the number of objects seen to date.
If you keep track of visited nodes, it will also work on any connected graph.
If you have a very large tree (or graph), perhaps spread over a number of servers and/or storage devices, you can use a different presentation of this algorithm, which provides for a certain level of parallelism (and also prevents the need to keep a global walk structure or pass values into the walk).
We assume that each node-server has direct access to k
objects and indirect access to some known number of children servers. The algorithm allows for redundant children, but assumes that network communication is (almost) perfect; dealing with network splits is outside of the scope of this answer. We also assume that every query has an associated unique query number, which allows us to deal with some networking artifacts. The query has no other information (other than the server to respond to), and is expected to return a tuple consisting of a count and a randomly-selected node.
When a node-server receives a query with id q
, it does the following:
If it has previously responded to query q
, immediately return <0, null>
Set count
to k
and selected
to a randomly-selected object from the k
objects it has direct access to.
For each child server, send the query (with the same query-id)
For each response returned (it doesn't matter what order the responses come in):
a. Add response.count
to count
b. With probability response.count / count
, replace selected
with response.selected
When all children servers have responded, return <count, selected>