Efficient sampling without replacement from a multiset

Question

Suppose I have a list of pairs of items and counts, such as

{("a", 500), ("b", 200), ("c", 2), ("d", 300)}

I'm interested in cases when the list is relatively short but the sum of counts is large; the specific case I'm looking at now has less than 10,000 distinct items but a sum of counts over 45 million.

I'd like a rather large random sample of items, say 5 million, drawn without replacement. What's an algorithm for doing that that requires space less than linear in the sample size, and time less than linear in the sample size?

By comparison, the naive approach of applying a weighted-choice function once for each item would obviously take time at least linear in the sample size, since you'd need 5 million weighted choices for 5 million items.

Extra credit for a Python solution.

Edit to emphasize how this is different from a typical weighted-choice problem: In the typical weighted-choice problem, the weights don't change. In the case of sampling without replacement from a multiset, they do, because the count of a removed item decreases. That's why you'd need to do 5 million weighted choices, if all you had was a weighted-choice algorithm.

Wouldn't the 'naive approach' you describe take time linear in the number of counts (not their sum)? — user2390182, May 01 '16 at 17:30
I marked one potential duplicate. If you search for "Python weighted choice" you can find many other approaches to this task. — BrenBarn, May 01 '16 at 17:30
@schwobaseggl No. If you wanted 5 million items, you'd need to do a weighted choice 5 million times, since the weights change each time, since you're sampling without replacement. — Kodiologist, May 01 '16 at 17:38
Still, you are updating the weights (takes same time for any count, no?) to refresh after each item. Doesn't that mean you are doing sth. `O(1)` for every count, not sth. `O(count)`? — user2390182, May 01 '16 at 17:42
@schwobaseggl I'm not sure what you mean, so I'll give an example to show what I mean, using the `a`, `b`, `c`, `d` list above. Suppose you want 200 items. You make a weighted choice and choose `b`, so you add 1 copy of `b` to your output list and decrement `b`'s count in the input list by 1. Then you make a new weighted choice, with the new list of counts, to get your second item. And so on 198 more times. — Kodiologist, May 01 '16 at 17:47
@Kodiologist Perfect explanation of my point ;-) The making of the new list of counts (or probabilitites in your example with 4 counts) is `O(4)`, not `O(1002)`. So you are left with a linear dependency of the number of items pulled (200) and the number of counts (4), not the sum of counts (1002). — user2390182, May 01 '16 at 17:52
Ahh, I see. By the "sum of counts", I really meant the number of items you want to pull, not the sum of counts in the original list. I'll edit. — Kodiologist, May 01 '16 at 17:54
I don't think you're going to be able to sample without replacement in less than linear time. — BrenBarn, May 01 '16 at 18:45

Efficient sampling without replacement from a multiset

0 Answers0