77

Selecting without any weights (equal probabilities) is beautifully described here.

I was wondering if there is a way to convert this approach to a weighted one.

I am also interested in other approaches as well.

Update: Sampling without replacement

Community
  • 1
  • 1
nimcap
  • 10,062
  • 15
  • 61
  • 69
  • 2
    Is the sampling with or without replacement? – Amro Jan 28 '10 at 02:13
  • 2
    Either way, it's a duplicate of http://stackoverflow.com/questions/352670/weighted-random-selection-with-and-without-replacement – Jason Orendorff Jan 28 '10 at 09:29
  • 1
    @Jason I was asking a way to convert that elegant approach to a weighted one, it is not quite duplicate – nimcap Jan 28 '10 at 15:39
  • nimcap: The question I linked to is about weighted random selection. – Jason Orendorff Jan 28 '10 at 16:38
  • Did you analyze applying a Fenwick tree to your problem? – sw. Mar 22 '12 at 14:25
  • How many times is this question going to be asked? – Jérémie May 20 '12 at 17:02
  • 4
    Sampling without replacement with weights *in such a way that the weights are proportional to inclusion probabilities of each element* is far from a trivial task, and there is good recent research about it. See for instance: http://books.google.com.br/books/about/Sampling_Algorithms.html?id=2auW1rVAwGMC&redir_esc=y – Ferdinand.kraft Aug 15 '13 at 10:41

14 Answers14

71

If the sampling is with replacement, you can use this algorithm (implemented here in Python):

import random

items = [(10, "low"),
         (100, "mid"),
         (890, "large")]

def weighted_sample(items, n):
    total = float(sum(w for w, v in items))
    i = 0
    w, v = items[0]
    while n:
        x = total * (1 - random.random() ** (1.0 / n))
        total -= x
        while x > w:
            x -= w
            i += 1
            w, v = items[i]
        w -= x
        yield v
        n -= 1

This is O(n + m) where m is the number of items.

Why does this work? It is based on the following algorithm:

def n_random_numbers_decreasing(v, n):
    """Like reversed(sorted(v * random() for i in range(n))),
    but faster because we avoid sorting."""
    while n:
        v *= random.random() ** (1.0 / n)
        yield v
        n -= 1

The function weighted_sample is just this algorithm fused with a walk of the items list to pick out the items selected by those random numbers.

This in turn works because the probability that n random numbers 0..v will all happen to be less than z is P = (z/v)n. Solve for z, and you get z = vP1/n. Substituting a random number for P picks the largest number with the correct distribution; and we can just repeat the process to select all the other numbers.

If the sampling is without replacement, you can put all the items into a binary heap, where each node caches the total of the weights of all items in that subheap. Building the heap is O(m). Selecting a random item from the heap, respecting the weights, is O(log m). Removing that item and updating the cached totals is also O(log m). So you can pick n items in O(m + n log m) time.

(Note: "weight" here means that every time an element is selected, the remaining possibilities are chosen with probability proportional to their weights. It does not mean that elements appear in the output with a likelihood proportional to their weights.)

Here's an implementation of that, plentifully commented:

import random

class Node:
    # Each node in the heap has a weight, value, and total weight.
    # The total weight, self.tw, is self.w plus the weight of any children.
    __slots__ = ['w', 'v', 'tw']
    def __init__(self, w, v, tw):
        self.w, self.v, self.tw = w, v, tw

def rws_heap(items):
    # h is the heap. It's like a binary tree that lives in an array.
    # It has a Node for each pair in `items`. h[1] is the root. Each
    # other Node h[i] has a parent at h[i>>1]. Each node has up to 2
    # children, h[i<<1] and h[(i<<1)+1].  To get this nice simple
    # arithmetic, we have to leave h[0] vacant.
    h = [None]                          # leave h[0] vacant
    for w, v in items:
        h.append(Node(w, v, w))
    for i in range(len(h) - 1, 1, -1):  # total up the tws
        h[i>>1].tw += h[i].tw           # add h[i]'s total to its parent
    return h

def rws_heap_pop(h):
    gas = h[1].tw * random.random()     # start with a random amount of gas

    i = 1                     # start driving at the root
    while gas >= h[i].w:      # while we have enough gas to get past node i:
        gas -= h[i].w         #   drive past node i
        i <<= 1               #   move to first child
        if gas >= h[i].tw:    #   if we have enough gas:
            gas -= h[i].tw    #     drive past first child and descendants
            i += 1            #     move to second child
    w = h[i].w                # out of gas! h[i] is the selected node.
    v = h[i].v

    h[i].w = 0                # make sure this node isn't chosen again
    while i:                  # fix up total weights
        h[i].tw -= w
        i >>= 1
    return v

def random_weighted_sample_no_replacement(items, n):
    heap = rws_heap(items)              # just make a heap...
    for i in range(n):
        yield rws_heap_pop(heap)        # and pop n items off it.
ech
  • 118
  • 2
  • 5
Jason Orendorff
  • 42,793
  • 6
  • 62
  • 96
  • +1 for awesome use of Python control structure variations that I haven't seen before – Sparr Jan 27 '10 at 23:57
  • See my answer to another question for a Python implementation of the binary-tree method: http://stackoverflow.com/questions/526255/probability-distribution-in-python/526843#526843 – Darius Bacon Jan 28 '10 at 02:56
  • Darius Bacon: Upvoted! While I was fiddling with this I found that you don't need a tree. It can be done with a heap. So I added an implementation of my own to this answer. – Jason Orendorff Jan 28 '10 at 06:56
  • Nice idea with the heap! I found the *name* 'heap' misleading for what you did, though, since that's usually taken to mean a structure that obeys the heap invariant on the weights. Here the heap invariant does apply, but to the .tw field and it's incidental to how your code works, if I understand it right. – Darius Bacon Jan 30 '10 at 01:41
  • BTW it looks like you could avoid leaving a 0 in the array by moving the array's last entry into its place, updating the .tw's back up to the root from the old and new positions. This doesn't appear worth it for this static problem, but it would for a dynamic one like my post was about. I'll update that post to point here. – Darius Bacon Jan 30 '10 at 01:45
  • Yes, I agree with both those comments. (I couldn't think of a better name for it than "heap", though. I guess it might be better call it a binary tree, and treat the array as an implementation detail. But that's confusing too!) – Jason Orendorff Jan 30 '10 at 13:43
  • There's a standard structure something like your heap known as the Fenwick tree: http://en.wikipedia.org/wiki/Fenwick_tree -- from a quick skim it doesn't sound the same, but I didn't really look into it. – Darius Bacon Oct 24 '10 at 02:24
  • I am confused, if each node in the tree has one child at h[i>>1] and one at h[(i<<1)+1], why does rws_heap() only calculate the total weights for h[i>>1]? Isn't it only getting the total weights for the right-hand children? – Tim Gilbert Mar 13 '12 at 23:17
  • Oh, per my earlier comment, I now see that h[i<<1] is a node's right-hand child and h[i>>1] is its parent, now I see - I mistook the direction of the bit-shift. I'll add a note to the comment about that. (Edit: and now I see the information is already in there, I just missed it. Sorry!) – Tim Gilbert Mar 13 '12 at 23:23
  • Is there any way to implement the Sampling without Replacement to not go through the O(m) path of constructing the heap when using weighted random sampling multiple times? (So I select 5 items from the heap, and then I want to select another 5 items from the same heap.) Right now, I have code that has the original values of the heap in a dictionary by index, "repairing" all indexes that have been modified. However, I want to avoid this step entirely. I tried copy.deepcopy for the heap but it yields much poorer performance than my current "repair" code. – chrtan May 03 '12 at 13:08
  • 1
    The "repair" strategy is probably the fastest. You can speed that up by storing the original values in each `Node`, rather than a separate dictionary. – Jason Orendorff May 03 '12 at 20:08
  • @JasonOrendorff Ahh. Got so excited about this working that I completely forgot to thank you! So thank you Jason for the speed up and the sample code. – chrtan May 10 '12 at 20:18
  • I tried to port this solution to C# but didn't get the expected results. Even when picking a single card the distribution is not respected: http://stackoverflow.com/questions/11775946/select-x-random-elements-from-a-weighted-list-in-c-sharp-without-replacement – Gabriel Aug 03 '12 at 02:25
  • I implemented this for items being a set of objects. One attribute of the object is a weight. I modified "for w, v in items: h.append(Node(w, v, w))" into "for object in items: h.append(Node(object.weight, object, object.weight". Is it the correct way to do it? – Antonin Oct 15 '12 at 09:23
  • I have an error: "if gas > h[i].tw:" / "IndexError: list index out of range". This error only happens when I'm using n=20, and doesn't appear for n=2... Any idea why? – Antonin Oct 15 '12 at 09:43
  • The "without replacement" strategy seems to be the same as suggested in the 1980 paper by Wong and Easton: [An Efficient Method for Weighted Sampling without Replacement](http://epubs.siam.org/action/showAbstract?page=111&volume=9&issue=1&journalCode=smjcat&). Or did I miss something? – krlmlr Feb 28 '13 at 22:56
  • 1
    JavaScript port (in case people want it): https://gist.github.com/seejohnrun/5291246 – John Apr 02 '13 at 10:17
  • Awesome answer btw - NOTE: you need to check the number of elements before selecting > items.length elements – John Apr 02 '13 at 10:17
  • If you want to pick x elements from a weighted set without replacement such that elements are chosen with a probability proportional to their weights, this random_weighted_sample_no_replacement() is not correct. Consider an input of {'a': 1, 'b': 2, 'c': 3}, pick 2. In this case, 'c' should be as likely as 'a' and 'b' combined, which means 'c' must be chosen on every invocation of the function. But random_weighted_sample_no_replacement() won't return a 'c' every time. – ech Dec 11 '14 at 23:36
  • @ErikHaugen The way I interpreted the problem was that each time an element is selected, the remaining possibilities are chosen with probability proportional to their weights. Otherwise many *inputs* would be invalid: for example {'a': 1, 'b': 1, 'c': 4}, pick 2. It is impossible to produce 'c' twice as often as 'a' and 'b' combined. – Jason Orendorff Dec 12 '14 at 18:33
  • @JasonOrendorff—You're right that some inputs don't work if we want weights to be proportional to likelihood of being chosen. But it seems fairly likely that someone wanting a "choose a subset of k weighted elements" algorithm means this by "weighted". – ech Dec 17 '14 at 08:02
  • @Jason Orendorff — What exactly does "weight" mean in random_weighted_sample_no_replacement()? In other words, if a client of this function cared about the exact behavior of it, what requirements might they have that would cause them to want this behavior? What properties does it have? I'm asking because I can't think of a meaningful answer, and if there isn't one then I think it might be a good idea to not leave this implementation here for people to find. – ech Mar 06 '15 at 17:22
  • @ech I would understand your concern better if you explained why the no-replacement algorithm bothers you and the with-replacement algorithm does not. Both use the same interpretation of "weight" (i.e. it's a measure of the probability of being picked in any given selection, not the probability of being picked overall after *n* selections) – Jason Orendorff Mar 07 '15 at 01:25
  • Your with-replacement algorithm seems fine because "weight" means "relative probability of being chosen", which I think is what people mean by "weight" in this context. Your no-replacement algorithm seems problematic to me because "weight" doesn't mean that. I'm not sure if "weight" has a useful meaning in your no-replacement algorithm, which is why I'm asking if there is one. – ech Mar 08 '15 at 02:27
  • OK but we've been over this. The meaning of "weight" is that an item's probability of being picked, on a given selection, is proportional to its weight relative to the weight of other items still remaining to be selected. – Jason Orendorff Mar 09 '15 at 23:06
  • This has nice per-selection properties (e.g. if you draw a sample of N items, then draw a sample of M items from the leftovers, you get exactly the same distribution as if you pulled N+M items in the first place). Yours has end-to-end properties instead. Both seem useful to me. – Jason Orendorff Mar 09 '15 at 23:10
  • My point is that the probability of being picked being proportional to the weight relative to the remaining items seems like an implementation detail. I can't figure out why someone might want that behavior. The "per-selection" property you outline seems interesting, but it's hard to think in those terms when "the same distribution" isn't something that anyone wants in the first place. (Unless it is...) Perhaps an example of when a client might want this would help me? – ech Mar 11 '15 at 00:53
  • I think you ought to at least include a comment there clarifying what "weight" means in that example, since it isn't what people normally mean by that term. – ech Mar 11 '15 at 00:55
  • Building the heap should be O(m log m), no? You need to sum up O(log m) items for each element. So overall O((m+n) log m)? – Joe K May 13 '15 at 23:27
  • @JoeK No, it's O(m). The algorithm in `rws_heap()` looks linear, right? Normally building a heap would be O(m log m), but this is an unusual use of a heap. Here the nodes do not need to be sorted at all. So we don't do any comparisons or swaps when building the heap. Rather, each node is assigned a number, the total weight `tw`, that's naturally greater than its child nodes' total weight. So the heap property is true "by construction", if that makes sense. – Jason Orendorff May 14 '15 at 12:43
  • @JoeK A previous commenter said "I found the name 'heap' misleading for what you did", and I had to agree. I don't know a better word for it, though! – Jason Orendorff May 14 '15 at 12:44
  • @JasonOrendorff ah, you're right. I read the description and only skimmed the code (as I'm not a python guy) and assumed that you would be updating the log(m) parent values as you insert new values. But doing it from the bottom up at the end in one pass is of course O(m) (and very clever!) – Joe K May 14 '15 at 17:06
  • @Widdershins I backed out your edit changing `>` to `>=` in two places. Note that `gas` is a random floating-point number, so it's very unlikely to matter. But if we do run out of gas exactly on the boundary between two regions, I think we should probably round down. `random.random()` returns a number 0 <= n < 0.5 with probability 1/2 and a number 0.5 <= n < 1 with probability 1/2; note that the exact value 0.5 is in the first set, not the second. – Jason Orendorff Feb 23 '16 at 16:01
  • @Jasen Orendorff You should absolutely round up. 100% certainty. Re-read your statement about sets from 0.0->0.5->1.0 again very carefully, it's patently incorrect. – Mumbleskates Feb 23 '16 at 22:44
  • @JasonOrendorff Just because you don't notice fencepost errors when using floating point numbers doesn't mean they are not there. – Mumbleskates Feb 23 '16 at 22:54
  • 1
    Heh! You are totally right. 0.5 is in the second set. OK, I'm cool with `>=`. – Jason Orendorff Feb 25 '16 at 11:11
  • @JasonOrendorff Yeah, it took me like an hour trying to figure out why some elements were never returning. My data set is integer-weighted, and the algorithm works the same with integer weights but produces more obvious bugs. I wrote up tests for JS, Python, and pure-SQL and eventually wrapped my head around the whole algorithm. For things like this, I feel it's best to start out making it for integers, which will always generalize readily into floating point since they are largely isometric. – Mumbleskates Feb 25 '16 at 18:16
  • To be clear: the algorithm for sampling without replacement uses a mathematical trick to get the selections *in a sorted order* efficiently? And this is more efficient because we can *decode the random variables into selections, in a single pass through the weights?* But the algorithm for sampling *with* replacement still picks the elements one at a time, so the probabilities of the outcomes are order-dependent? – Karl Knechtel Jul 05 '22 at 21:18
  • @KarlKnechtel Questions 1 and 2: The answer is yes if you meant the first algorithm (_with_ replacement). Question 3: The second algorithm (_without_ replacement) picks elements one at a time. The output is in a random order but not uniformly random: heavy items are more likely to be picked early. – Jason Orendorff Jul 26 '22 at 17:38
45

If the sampling is with replacement, use the roulette-wheel selection technique (often used in genetic algorithms):

  1. sort the weights
  2. compute the cumulative weights
  3. pick a random number in [0,1]*totalWeight
  4. find the interval in which this number falls into
  5. select the elements with the corresponding interval
  6. repeat k times

alt text

If the sampling is without replacement, you can adapt the above technique by removing the selected element from the list after each iteration, then re-normalizing the weights so that their sum is 1 (valid probability distribution function)

Community
  • 1
  • 1
Amro
  • 123,847
  • 25
  • 243
  • 454
  • 15
    +1, this wins big on clarity. But note that the roulette-wheel algorithm takes O(*n* log *m* + *m*) time, where *n* is the number of samples and *m* is the number of items. That's if you omit the sorting, which is unnecessary, and do a binary search in step 4. Also, it requires O(*m*) space for the cumulative weights. In my answer there's a 14-line function that does the same thing in O(*n* + *m*) time and O(1) space. – Jason Orendorff Jan 28 '10 at 09:41
  • 1
    If I have to remove selected element I need to copy the whole list, I am assuming we are not allowed to do any modification on the input list, which is expensive. – nimcap Jan 28 '10 at 10:03
  • do you need to sort the weights? is it necassary? –  Jan 04 '11 at 06:20
  • @Zenikoder: I believe you right in that sorting is not needed, it's just that it might make the search in step 4 easier... – Amro Jan 07 '11 at 12:30
  • 1
    Do you think that using a Fenwick tree can help here? – sw. Mar 22 '12 at 14:22
  • 2
    Without replacement method proposed is not good. Google for "systematic sampling with unequal probabilities". There's a O(n) algorithm and no need to recompute weights. – Pawel Apr 14 '12 at 20:17
  • *"If the sampling is without replacement, you can adapt the above technique by removing the selected element from the list after each iteration, then re-normalizing the weights so that their sum is 1 (valid probability distribution function)"* This is a tempting algorithm, but if you try to prove that it's correct, you'll find out it isn't: it actually skews the distribution. However, there is a very clever way to adapt the roulette-wheel selection to sampling without replacement. – Stef Sep 20 '21 at 13:56
  • Sampling k items without replacement: (1) Normalize the weights w_i so that sum w_i = k. (2) Compute the cumulative weights (sorting isn't necessary). (3) Pick a random number x in [0,1]. (4) Select the k elements corresponding to the k numbers: x, x+1, x+2, ..., x+k-1. – Stef Sep 20 '21 at 13:57
33

I know this is a very old question, but I think there's a neat trick to do this in O(n) time if you apply a little math!

The exponential distribution has two very useful properties.

  1. Given n samples from different exponential distributions with different rate parameters, the probability that a given sample is the minimum is equal to its rate parameter divided by the sum of all rate parameters.

  2. It is "memoryless". So if you already know the minimum, then the probability that any of the remaining elements is the 2nd-to-min is the same as the probability that if the true min were removed (and never generated), that element would have been the new min. This seems obvious, but I think because of some conditional probability issues, it might not be true of other distributions.

Using fact 1, we know that choosing a single element can be done by generating these exponential distribution samples with rate parameter equal to the weight, and then choosing the one with minimum value.

Using fact 2, we know that we don't have to re-generate the exponential samples. Instead, just generate one for each element, and take the k elements with lowest samples.

Finding the lowest k can be done in O(n). Use the Quickselect algorithm to find the k-th element, then simply take another pass through all elements and output all lower than the k-th.

A useful note: if you don't have immediate access to a library to generate exponential distribution samples, it can be easily done by: -ln(rand())/weight

Joe K
  • 18,204
  • 2
  • 36
  • 58
  • 1
    I think this is the only correct answer here after so many years. I once found this correct way but never remembered to enter here. I will read your answer thoroughly and accept it. – nimcap May 15 '15 at 10:21
  • 3
    Is there a reference to this method? I found [Weighted Random Sampling](https://docdro.id/XTglD09) which has similar idea. While I see the logic in that what I miss is a reference tho show this is the correct distribution. – Royi Dec 22 '18 at 17:52
  • 1
    No, sorry, I don't have one; this was something I came up with on my own. So you're right to be skeptical. But I'd be surprised if someone smarter than I hadn't discovered it before and given it a more rigorous treatment than a stack overflow post. – Joe K Dec 23 '18 at 21:40
  • i implemented this in JavaScript but it does not give the expected results: https://jsfiddle.net/gasparl/kamub5rq/18/ - any ideas why? – gaspar Aug 31 '19 at 09:54
3

I've done this in Ruby

https://github.com/fl00r/pickup

require 'pickup'
pond = {
  "selmon"  => 1,
  "carp" => 4,
  "crucian"  => 3,
  "herring" => 6,
  "sturgeon" => 8,
  "gudgeon" => 10,
  "minnow" => 20
}
pickup = Pickup.new(pond, uniq: true)
pickup.pick(3)
#=> [ "gudgeon", "herring", "minnow" ]
pickup.pick
#=> "herring"
pickup.pick
#=> "gudgeon"
pickup.pick
#=> "sturgeon"
fl00r
  • 82,987
  • 33
  • 217
  • 237
  • This version returns wrong answers compared to Jason Orendorff's posting. Specifically, on weights like pick(4, unique) from [1,1,1,1,9996], the results of the low-weight items are NOT even. – robbat2 Jan 20 '16 at 01:10
1

Here's a Go implementation from geodns:

package foo

import (
    "log"
    "math/rand"
)

type server struct {
    Weight int
    data   interface{}
}

func foo(servers []server) {
    // servers list is already sorted by the Weight attribute

    // number of items to pick
    max := 4

    result := make([]server, max)

    sum := 0
    for _, r := range servers {
        sum += r.Weight
    }

    for si := 0; si < max; si++ {
        n := rand.Intn(sum + 1)
        s := 0

        for i := range servers {
            s += int(servers[i].Weight)
            if s >= n {
                log.Println("Picked record", i, servers[i])
                sum -= servers[i].Weight
                result[si] = servers[i]

                // remove the server from the list
                servers = append(servers[:i], servers[i+1:]...)
                break
            }
        }
    }

    return result
}
Ask Bjørn Hansen
  • 6,784
  • 2
  • 26
  • 40
1

If you want to pick x elements from a weighted set without replacement such that elements are chosen with a probability proportional to their weights:

import random

def weighted_choose_subset(weighted_set, count):
    """Return a random sample of count elements from a weighted set.

    weighted_set should be a sequence of tuples of the form 
    (item, weight), for example:  [('a', 1), ('b', 2), ('c', 3)]

    Each element from weighted_set shows up at most once in the
    result, and the relative likelihood of two particular elements
    showing up is equal to the ratio of their weights.

    This works as follows:

    1.) Line up the items along the number line from [0, the sum
    of all weights) such that each item occupies a segment of
    length equal to its weight.

    2.) Randomly pick a number "start" in the range [0, total
    weight / count).

    3.) Find all the points "start + n/count" (for all integers n
    such that the point is within our segments) and yield the set
    containing the items marked by those points.

    Note that this implementation may not return each possible
    subset.  For example, with the input ([('a': 1), ('b': 1),
    ('c': 1), ('d': 1)], 2), it may only produce the sets ['a',
    'c'] and ['b', 'd'], but it will do so such that the weights
    are respected.

    This implementation only works for nonnegative integral
    weights.  The highest weight in the input set must be less
    than the total weight divided by the count; otherwise it would
    be impossible to respect the weights while never returning
    that element more than once per invocation.
    """
    if count == 0:
        return []

    total_weight = 0
    max_weight = 0
    borders = []
    for item, weight in weighted_set:
        if weight < 0:
            raise RuntimeError("All weights must be positive integers")
        # Scale up weights so dividing total_weight / count doesn't truncate:
        weight *= count
        total_weight += weight
        borders.append(total_weight)
        max_weight = max(max_weight, weight)

    step = int(total_weight / count)

    if max_weight > step:
        raise RuntimeError(
            "Each weight must be less than total weight / count")

    next_stop = random.randint(0, step - 1)

    results = []
    current = 0
    for i in range(count):
        while borders[current] <= next_stop:
            current += 1
        results.append(weighted_set[current][0])
        next_stop += step

    return results
ech
  • 118
  • 2
  • 5
1

If you want to generate large arrays of random integers with replacement, you can use piecewise linear interpolation. For example, using NumPy/SciPy:

import numpy
import scipy.interpolate

def weighted_randint(weights, size=None):
    """Given an n-element vector of weights, randomly sample
    integers up to n with probabilities proportional to weights"""
    n = weights.size
    # normalize so that the weights sum to unity
    weights = weights / numpy.linalg.norm(weights, 1)
    # cumulative sum of weights
    cumulative_weights = weights.cumsum()
    # piecewise-linear interpolating function whose domain is
    # the unit interval and whose range is the integers up to n
    f = scipy.interpolate.interp1d(
            numpy.hstack((0.0, weights)),
            numpy.arange(n + 1), kind='linear')
    return f(numpy.random.random(size=size)).astype(int)

This is not effective if you want to sample without replacement.

chairmanK
  • 91
  • 4
0

This one does exactly that with O(n) and no excess memory usage. I believe this is a clever and efficient solution easy to port to any language. The first two lines are just to populate sample data in Drupal.

function getNrandomGuysWithWeight($numitems){
  $q = db_query('SELECT id, weight FROM theTableWithTheData');
  $q = $q->fetchAll();

  $accum = 0;
  foreach($q as $r){
    $accum += $r->weight;
    $r->weight = $accum;
  }

  $out = array();

  while(count($out) < $numitems && count($q)){
    $n = rand(0,$accum);
    $lessaccum = NULL;
    $prevaccum = 0;
    $idxrm = 0;
    foreach($q as $i=>$r){
      if(($lessaccum == NULL) && ($n <= $r->weight)){
        $out[] = $r->id;
        $lessaccum = $r->weight- $prevaccum;
        $accum -= $lessaccum;
        $idxrm = $i;
      }else if($lessaccum){
        $r->weight -= $lessaccum;
      }
      $prevaccum = $r->weight;
    }
    unset($q[$idxrm]);
  }
  return $out;
}
jacmkno
  • 1,189
  • 11
  • 25
0

In the question you linked to, Kyle's solution would work with a trivial generalization. Scan the list and sum the total weights. Then the probability to choose an element should be:

1 - (1 - (#needed/(weight left)))/(weight at n). After visiting a node, subtract it's weight from the total. Also, if you need n and have n left, you have to stop explicitly.

You can check that with everything having weight 1, this simplifies to kyle's solution.

Edited: (had to rethink what twice as likely meant)

Kyle Butt
  • 9,340
  • 3
  • 22
  • 15
  • 1
    let's say I have 4 elements in the list with weights {2, 1, 1, 1}. I am going to choose 3 from this list. According to your formula, for the first element 3* 2/5 = 1.2 Which is >1 What am I missing here? – nimcap Jan 26 '10 at 17:03
  • now with {2,1,1,1} choose 3, the probability of picking the first element is 1 - (1 - (3/5))/2 = 1 - (2/5)/2 = 1 - 1/5 = 4/5, as expected. – Kyle Butt Jan 26 '10 at 17:58
  • I believe there is something wrong with your formula, I don't have time to write it now, I will when I have time. If you try to apply formula for the first two elements in different orders, you will see it won't produce the same results. It should provide the same results regardless of the order. – nimcap Jan 26 '10 at 22:32
  • say there are 4 elements with weights {7, 1, 1, 1} and we are going to pick 2. Lets calculate the chance of picking first 2 elements: P(1st)*P(2nd) = (1-(1-2/10)/7)*(1-(1-1/3)) = 31/105. Let's change the list to {1, 7, 1, 1}, probability of picking the first 2 elements should remain same P(1st)*P(2nd) = (1-(1-2/10))*(1-(1-1/9)/7) = 11/63. They are not same. – nimcap Jan 27 '10 at 08:42
  • Even simpler, say there are 2 elements with weights {1/2, 1/2} and we are going to pick 2. The probability should be 1; we have to take both. But the formula gives 1 - (1 - (2/1)))/(1/2) = 3. – Jason Orendorff Jan 28 '10 at 13:21
  • Unfortunately I don't see an easy way to simplify calculating the real probability, which is P(a, n) = if n == 0 then 0 else 1 - sum i=1..|a| of (weight(a[i]) / total_weight(a) * (1 - P(a omitting a[i], n - 1))). – Jason Orendorff Jan 28 '10 at 13:30
0

I putting here a simple solution for picking 1 item, you can easily expand it for k items (Java style):

double random = Math.random();
double sum = 0;
for (int i = 0; i < items.length; i++) {
    val = items[i];
    sum += val.getValue();
    if (sum > random) {
        selected = val;
        break;
    }
}
shem
  • 4,686
  • 2
  • 32
  • 43
0

I have implemented an algorithm similar to Jason Orendorff's idea in Rust here. My version additionally supports bulk operations: insert and remove (when you want to remove a bunch of items given by their ids, not through the weighted selection path) from the data structure in O(m + log n) time where m is the number of items to remove and n the number of items in stored.

kirillkh
  • 181
  • 3
  • 12
0

Sampling wihout replacement with recursion - elegant and very short solution in c#

//how many ways we can choose 4 out of 60 students, so that every time we choose different 4

class Program
{
    static void Main(string[] args)
    {
        int group = 60;
        int studentsToChoose = 4;

        Console.WriteLine(FindNumberOfStudents(studentsToChoose, group));
    }

    private static int FindNumberOfStudents(int studentsToChoose, int group)
    {
        if (studentsToChoose == group || studentsToChoose == 0)
            return 1;

        return FindNumberOfStudents(studentsToChoose, group - 1) + FindNumberOfStudents(studentsToChoose - 1, group - 1);

    }
}
Angel
  • 90
  • 5
0

I just spent a few hours trying to get behind the algorithms underlying sampling without replacement out there and this topic is more complex than I initially thought. That's exciting! For the benefit of a future readers (have a good day!) I document my insights here including a ready to use function which respects the given inclusion probabilities further below. A nice and quick mathematical overview of the various methods can be found here: Tillé: Algorithms of sampling with equal or unequal probabilities. For example Jason's method can be found on page 46. The caveat with his method is that the weights are not proportional to the inclusion probabilities as also noted in the document. Actually, the i-th inclusion probabilities can be recursively computed as follows:

def inclusion_probability(i, weights, k):
    """
        Computes the inclusion probability of the i-th element
        in a randomly sampled k-tuple using Jason's algorithm
        (see https://stackoverflow.com/a/2149533/7729124)
    """
    if k <= 0: return 0
    cum_p = 0
    for j, weight in enumerate(weights):
        # compute the probability of j being selected considering the weights
        p = weight / sum(weights)

        if i == j:
            # if this is the target element, we don't have to go deeper,
            # since we know that i is included
            cum_p += p
        else:
            # if this is not the target element, than we compute the conditional
            # inclusion probability of i under the constraint that j is included
            cond_i = i if i < j else i-1
            cond_weights = weights[:j] + weights[j+1:]
            cond_p = inclusion_probability(cond_i, cond_weights, k-1)
            cum_p += p * cond_p
    return cum_p

And we can check the validity of the function above by comparing

In : for i in range(3): print(i, inclusion_probability(i, [1,2,3], 2))
0 0.41666666666666663
1 0.7333333333333333
2 0.85

to

In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: random_weighted_sample_no_replacement([(1,'a'),(2,'b'),(3,'c')],2))
Out: Counter({'a': 4198, 'b': 7268, 'c': 8534})

One way - also suggested in the document above - to specify the inclusion probabilities is to compute the weights from them. The whole complexity of the question at hand stems from the fact that one cannot do that directly since one basically has to invert the recursion formula, symbolically I claim this is impossible. Numerically it can be done using all kind of methods, e.g. Newton's method. However the complexity of inverting the Jacobian using plain Python becomes unbearable quickly, I really recommend looking into numpy.random.choice in this case.

Luckily there is method using plain Python which might or might not be sufficiently performant for your purposes, it works great if there aren't that many different weights. You can find the algorithm on page 75&76. It works by splitting up the sampling process into parts with the same inclusion probabilities, i.e. we can use random.sample again! I am not going to explain the principle here since the basics are nicely presented on page 69. Here is the code with hopefully a sufficient amount of comments:

def sample_no_replacement_exact(items, k, best_effort=False, random_=None, ε=1e-9):
    """
        Returns a random sample of k elements from items, where items is a list of
        tuples (weight, element). The inclusion probability of an element in the
        final sample is given by
           k * weight / sum(weights).

        Note that the function raises if a inclusion probability cannot be
        satisfied, e.g the following call is obviously illegal:
           sample_no_replacement_exact([(1,'a'),(2,'b')],2)
        Since selecting two elements means selecting both all the time,
        'b' cannot be selected twice as often as 'a'. In general it can be hard to
        spot if the weights are illegal and the function does *not* always raise
        an exception in that case. To remedy the situation you can pass
        best_effort=True which redistributes the inclusion probability mass
        if necessary. Note that the inclusion probabilities will change
        if deemed necessary.

        The algorithm is based on the splitting procedure on page 75/76 in:
        http://www.eustat.eus/productosServicios/52.1_Unequal_prob_sampling.pdf
        Additional information can be found here:
        https://stackoverflow.com/questions/2140787/

        :param items: list of tuples of type weight,element
        :param k: length of resulting sample
        :param best_effort: fix inclusion probabilities if necessary,
                            (optional, defaults to False)
        :param random_: random module to use (optional, defaults to the
                        standard random module)
        :param ε: fuzziness parameter when testing for zero in the context
                  of floating point arithmetic (optional, defaults to 1e-9)
        :return: random sample set of size k
        :exception: throws ValueError in case of bad parameters,
                    throws AssertionError in case of algorithmic impossibilities
    """
    # random_ defaults to the random submodule
    if not random_:
        random_ = random

    # special case empty return set
    if k <= 0:
        return set()

    if k > len(items):
        raise ValueError("resulting tuple length exceeds number of elements (k > n)")

    # sort items by weight
    items = sorted(items, key=lambda item: item[0])

    # extract the weights and elements
    weights, elements = list(zip(*items))

    # compute the inclusion probabilities (short: π) of the elements
    scaling_factor = k / sum(weights)
    π = [scaling_factor * weight for weight in weights]

    # in case of best_effort: if a inclusion probability exceeds 1,
    # try to rebalance the probabilities such that:
    # a) no probability exceeds 1,
    # b) the probabilities still sum to k, and
    # c) the probability masses flow from top to bottom:
    #    [0.2, 0.3, 1.5] -> [0.2, 0.8, 1]
    # (remember that π is sorted)
    if best_effort and π[-1] > 1 + ε:
        # probability mass we still we have to distribute
        debt = 0.
        for i in reversed(range(len(π))):
            if π[i] > 1.:
                # an 'offender', take away excess
                debt += π[i] - 1.
                π[i] = 1.
            else:
                # case π[i] < 1, i.e. 'save' element
                # maximum we can transfer from debt to π[i] and still not
                # exceed 1 is computed by the minimum of:
                # a) 1 - π[i], and
                # b) debt
                max_transfer = min(debt, 1. - π[i])
                debt -= max_transfer
                π[i] += max_transfer
        assert debt < ε, "best effort rebalancing failed (impossible)"

    # make sure we are talking about probabilities
    if any(not (0 - ε <= π_i <= 1 + ε) for π_i in π):
        raise ValueError("inclusion probabilities not satisfiable: {}" \
                         .format(list(zip(π, elements))))

    # special case equal probabilities
    # (up to fuzziness parameter, remember that π is sorted)
    if π[-1] < π[0] + ε:
        return set(random_.sample(elements, k))

    # compute the two possible lambda values, see formula 7 on page 75
    # (remember that π is sorted)
    λ1 = π[0] * len(π) / k
    λ2 = (1 - π[-1]) * len(π) / (len(π) - k)
    λ = min(λ1, λ2)

    # there are two cases now, see also page 69
    # CASE 1
    # with probability λ we are in the equal probability case
    # where all elements have the same inclusion probability
    if random_.random() < λ:
        return set(random_.sample(elements, k))

    # CASE 2:
    # with probability 1-λ we are in the case of a new sample without
    # replacement problem which is strictly simpler,
    # it has the following new probabilities (see page 75, π^{(2)}):
    new_π = [
        (π_i - λ * k / len(π))
        /
        (1 - λ)
        for π_i in π
    ]
    new_items = list(zip(new_π, elements))

    # the first few probabilities might be 0, remove them
    # NOTE: we make sure that floating point issues do not arise
    #       by using the fuzziness parameter
    while new_items and new_items[0][0] < ε:
        new_items = new_items[1:]

    # the last few probabilities might be 1, remove them and mark them as selected
    # NOTE: we make sure that floating point issues do not arise
    #       by using the fuzziness parameter
    selected_elements = set()
    while new_items and new_items[-1][0] > 1 - ε:
        selected_elements.add(new_items[-1][1])
        new_items = new_items[:-1]

    # the algorithm reduces the length of the sample problem,
    # it is guaranteed that:
    # if λ = λ1: the first item has probability 0
    # if λ = λ2: the last item has probability 1
    assert len(new_items) < len(items), "problem was not simplified (impossible)"

    # recursive call with the simpler sample problem
    # NOTE: we have to make sure that the selected elements are included
    return sample_no_replacement_exact(
        new_items,
        k - len(selected_elements),
        best_effort=best_effort,
        random_=random_,
        ε=ε
    ) | selected_elements

Example:

In : sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c')],2)
Out: {'b', 'c'}

In : import collections, itertools
In : sample_tester = lambda f: collections.Counter(itertools.chain(*(f() for _ in range(10000))))
In : sample_tester(lambda: sample_no_replacement_exact([(1,'a'),(2,'b'),(3,'c'),(4,'d')],2))
Out: Counter({'a': 2048, 'b': 4051, 'c': 5979, 'd': 7922})

The weights sum up to 10, hence the inclusion probabilities compute to: a → 20%, b → 40%, c → 60%, d → 80%. (Sum: 200% = k.) It works!

Just one word of caution for the productive use of this function, it can be very hard to spot illegal inputs for the weights. An obvious illegal example is

In: sample_no_replacement_exact([(1,'a'),(2,'b')],2)
ValueError: inclusion probabilities not satisfiable: [(0.6666666666666666, 'a'), (1.3333333333333333, 'b')]

b cannot appear twice as often as a since both have to be always be selected. There are more subtle examples. To avoid an exception in production just use best_effort=True, which rebalances the inclusion probability mass such that we have always a valid distribution. Obviously this might change the inclusion probabilities.

-2

I used a associative map (weight,object). for example:

{
(10,"low"),
(100,"mid"),
(10000,"large")
}

total=10110

peek a random number between 0 and 'total' and iterate over the keys until this number fits in a given range.

Pierre
  • 34,472
  • 31
  • 113
  • 192
  • 1
    The way I see it, the question is about selecting multiple items in a single pass (see the linked question). Your approach requires a pass for each selection. – Oak Jan 27 '10 at 17:22