2

I would like to discuss how to optimally implement the following:

Given a set of elements,

all of them are distinct. Call this input set N, it has size n. These element each have a weight. Now, you need to create a subset (R) of this N of size r, where r is smaller then or equal to n. This subset, again, cannot contain repetitions. Every element of R needs to be randomly chosen from N, if an element has a higher weight, the chance of it being chosen needs to be higher. If the total weight of all element in N is W, then the chance that element i is chosen needs to be w_i/W.

One last detail is that the weights are prone to change, but only to increase, they cannot decrease.

I want a neat implementation for this. I am working in Java, but I am hoping to find some language agnostic properties or details (or even pseudo code) that are interesting.

Now, for my own solution: I create an Array List of the original set of elements. I make sure that the weights are natural numbers and I add them n times if the weight of the element is n. Then I shuffle the arraylist (collections.shuffle) and I keep on taking element from the shuffled list and adding them to a Java Set until the size of the set is r. If the weight of an element increases, it gets added some more times to the original array list. Shuffle again, make a new subset.

My implementation needs loads of memory and the shuffle thingy is also slower if the set becomes bigger. Is there a better idea out there?

Cheiron
  • 3,620
  • 4
  • 32
  • 63

1 Answers1

1

First, let's simplify it to drawing only one element, you calcualte

sum[-1] = 0
sum[i] = sum[i-1] + weight[i]

Then, you simply draw a number r in range [0,sum), and do binary search for r. The range it falls on, is the number you draw.
This is O(n) time solution.

You can obviously do it for more elements, but you will have to remove elements you have chosen from the set, or to repeat until you chose a new item. Both solutions however decay to quadric complexity for large subset size :(

But, can we improve it to do better?
Yes. Use a binary search tree instead of array. The binary search tree is actually a variation of order statistics tree, where instead of storing #children(v), you store the sum of weights in each subtree. Other than that - that's basically remains the same as order statistics tree.

More information about the tree solution can be found as an answer to a similar question: Algorithm to shuffle an Array randomly based on different weights

Complexity of building the tree is O(nlogn), and each query + removal is O(logn), which gives you O(nlogn + klogn) = O(nlogn)

So, we have two options:

If k in o(logn) (little o here), prefer recreating array with O(n) time per each query. Otherwise, you should prefer (in terms of time complexity) O(nlogn) tree based solution.
In terms of space, both solutions are linear in amount of extra space needed.


This can be even done better, with a single pass. This is called weighted reservoir sampling. It's main drawback is instability due to numeric issues for large weights due to the exponent part (at least from my experience).

This solution runs in linear time, with O(1) extra space (if not including the output array of size k).

Community
  • 1
  • 1
amit
  • 175,853
  • 27
  • 231
  • 333