17

Is there an algorithm for how to perform reservoir sampling when the points in the data stream have associated weights?

Peter O.
  • 32,158
  • 14
  • 82
  • 96
Budhapest
  • 601
  • 1
  • 5
  • 12

2 Answers2

22

The algorithm by Pavlos Efraimidis and Paul Spirakis solves exactly this problem. The original paper with complete proofs is published with the title "Weighted random sampling with a reservoir" in Information Processing Letters 2006, but you can find a simple summary here.

The algorithm works as follows. First observe that another way to solve the unweighted reservoir sampling is to assign to each element a random id R between 0 and 1 and incrementally (say with a heap) keep track of the top k ids. Now let's look at weighted version, and let's say the i-th element has weight w_i. Then, we modify the algorithm by choosing the id of the i-th element to be R^(1/w_i) where R is again uniformly distributed in (0,1).

Another article talking about this algorithm is this one by the Cloudera folks.

had00b
  • 379
  • 2
  • 4
  • 5
    And a one line python implementation: `heapq.nlargest(k, items, key=lambda item: math.pow(random.random(), 1/weight(item)))` – Juan A. Navarro Mar 15 '15 at 11:18
  • Is it possible to do this with replacement? – Simd Feb 12 '17 at 21:40
  • @eleanora There is no point to doing this with replacement since the alias method exist, you need to first create a table, which takes O(n) time, and then each selection is O(1). Alias however doesn't keep its runtime complexity on selection unless you use it with replacement. – Krupip May 13 '17 at 20:37
  • Cloudera seems to have taken the original article down. What appears to be a copy can be found [here](https://www.geek-share.com/detail/2682415762.html). – Scott Kaiser Dec 13 '19 at 22:45
4

You can try the A-ES algorithm from this paper of S. Efraimidis. It's quite simple to code and very efficient.

Hope this helps,

Benoit

bmat06
  • 84
  • 2