Suppose I have a list of pairs of items and counts, such as
{("a", 500), ("b", 200), ("c", 2), ("d", 300)}
I'm interested in cases when the list is relatively short but the sum of counts is large; the specific case I'm looking at now has less than 10,000 distinct items but a sum of counts over 45 million.
I'd like a rather large random sample of items, say 5 million, drawn without replacement. What's an algorithm for doing that that requires space less than linear in the sample size, and time less than linear in the sample size?
By comparison, the naive approach of applying a weighted-choice function once for each item would obviously take time at least linear in the sample size, since you'd need 5 million weighted choices for 5 million items.
Extra credit for a Python solution.
Edit to emphasize how this is different from a typical weighted-choice problem: In the typical weighted-choice problem, the weights don't change. In the case of sampling without replacement from a multiset, they do, because the count of a removed item decreases. That's why you'd need to do 5 million weighted choices, if all you had was a weighted-choice algorithm.