How to produce a budget-constrained weighted random sample, where items have varying probabilities and weights?

Question

Suppose I want to select two records from a set of three, where the probabilities of the three are 0.1, 0.5, and 0.4, respectively. Per this SO answer, numpy.random.choice will work:

import pandas as pd
from numpy import random

df = pd.DataFrame({'prob': [0.1, 0.5, 0.4]})

random.seed(0)
random.choice(df.index, p=df.prob, size=2, replace=False)
# array([1, 2])

Now suppose each item also has a weight, and rather than selecting two items, I want to select a maximum weight. So if these items have weight of 4, 5, and 6, and I have a budget of 10, I could select {0, 1} or {0, 2}. The relative probabilities of each item being included would still be governed by the probabilities (though in practice I think an algorithm would return item 1 more often because its low weight can serve as a filler).

Is there a way to adapt random.choice for this, or another approach to yield this result?

Easiest way is to re-choice until you got a valid set. Does efficiency important? — Ohad Eytan, Jan 02 '19 at 22:14
The use case will be over a few hundred thousand records, so efficiency isn't critical but I'd like to minimize looping. — Max Ghenis, Jan 02 '19 at 22:17
You can choose one item at a time and update the probabilities for each step. — Ohad Eytan, Jan 02 '19 at 22:20
You're right, updated. And yes I'll work on the one-item-at-a-time approach. — Max Ghenis, Jan 02 '19 at 22:48

score 1 · Answer 1 · answered Jan 03 '19 at 00:17

Here's a one-at-a-time approach:

Get the set of items with weight below the budget.
Select a random item from this set, according to each's probability.
Add this to a running list and remove it from the set of available items.
Repeat 1-3 until no remaining items can fill the gap between the accrued weight and the budget.

Here's a function to do it, which as expected only produces the sets {0, 1} and {0, 2} in the example:

def weighted_budgeted_random_sample(df, budget):
    """ Produce a weighted budgeted random sample.

    Args:
        df: DataFrame with columns for `prob` and `weight`.
        budget: Total weight budget.

    Returns:
        List of index values of df that constitute the sample.

    """
    ids = []
    total = 0
    while total < budget:
        remaining = budget - total
        df = df[df.weight <= remaining]
        # Stop if there are no records with small enough weight.
        if df.shape[0] == 0:
            break
        # Select one record.
        selection = random.choice(df.index, p=(df.prob / df.prob.sum()))
        total += df.loc[selection].weight
        df.drop(selection, inplace=True)
        ids.append(selection)
    return ids

Example:

df = pd.DataFrame({
    'weight': [4, 5, 6],
    'prob': [0.1, 0.5, 0.8]
})

weighted_budgeted_random_sample(df, 10)
# [2, 0]

This could probably be optimized by starting with random.choice for a number of items that wouldn't be constrained by the budget.

score 1 · Accepted Answer · answered Jan 03 '19 at 04:24

What you can do, is using np.random.choice with the probabilities like you do but for the full size of you data. Then reindex the df with the new order you get from np.random.choice. Use cumsum on the column weight and finally return only the index until it reaches the value you want.

def weighted_budgeted_random_sample_all(df, budget):
   random_index_order = np.random.choice( df.index, size = len(df), 
                                          p = df.prob, replace = False)
   s = df.reindex(random_index_order).weight.cumsum()
   return s[s <= budget].index.values

Now the problem with this method, is that with the df as in the question and a budget of 10, then some solutions are only the index 1 or 2 because if random_index_order is equal to [2,1,0] or [1,2,0] then the cumsum is higher than 10 at the second row.

See with a Counter, the use of tuple and np.sort are just to make the Counter work and easier to see the result:

from collections import Counter
print (Counter([ tuple(np.sort(weighted_budgeted_random_sample_all(df,10))) 
                 for i in range(1000)]))
# Counter({(0, 1): 167, (0, 2): 111, (1,): 390, (2,): 332})

as you can see, some draws were in the order with 2 and 3 as the first 2 values and the result is only 2 or 3 because the sum of their weights is 11.

But actually, if you try the same thing with a budget of 11, then you get the expected output:

print (Counter([ tuple(np.sort(weighted_budgeted_random_sample_all(df,11))) 
                 for i in range(1000)]))
# Counter({(0, 1): 169, (0, 2): 111, (1, 2): 720})

Here you find the three possibles set and the fact that the set {1,2} is obtained more often make sense.

I saw that you revised your question after a comment saying you 'll work on the one-item-at-a-time approach. I believe than doing this will have an effect on the overall probabilities but I don't know enough in probabilities to say why. If you really want, then I assume you can combine your approach and mine to gain some time:

def weighted_budgeted_random_sample_mixed(df, budget):
    ids = []
    total = 0
    dftemp = df.copy()
    while total < budget:
        remaining = budget - total
        dftemp = dftemp[dftemp.weight <= remaining]
        # Stop if there are no records with small enough weight.
        if dftemp.shape[0] == 0:
            break
        # New order
        new_index = np.random.choice( dftemp.index, size = len(dftemp), 
                                      p = (dftemp.prob/dftemp.prob.sum()), 
                                      replace = False)
        s = dftemp.reindex(new_index).weight.cumsum()
        #select only the necessary rows
        s = s[s <= remaining] 
        total += s.max() #last value in s which is less than remaining
        dftemp.drop(s.index, inplace=True)
        ids += s.index.tolist()
    return ids

Now for comparison with your method in terms of result:

#your approach
print (Counter([ tuple(np.sort(weighted_budgeted_random_sample(df,10))) 
                 for i in range(1000)]))
#Counter({(0, 1): 546, (0, 2): 454})

#mixed approach
print (Counter([ tuple(np.sort(weighted_budgeted_random_sample_mixed(df,10))) 
                 for i in range(1000)])
#Counter({(0, 1): 554, (0, 2): 446})

As you can see, the result are quite similar and the mixed approach should be faster on larger dataframe as it minimize looping in the while

Thank you this is great, and good to learn about `Counter`. I agree that the one-at-a-time approach affects probabilities, especially with few items (1 gets selected 55% of the time even though it has only a 10% probability). In practice I'll have hundreds of thousands of items to choose from, and hitting the target closely is more important, so this won't be such an issue. — Max Ghenis, Jan 03 '19 at 16:31
@MaxGhenis glad it make it for you. thanks for confirming for the affects on probabilities, I'm learning about it these days for a new job I start soon :) — Ben.T, Jan 03 '19 at 17:50

How to produce a budget-constrained weighted random sample, where items have varying probabilities and weights?

2 Answers2