What you can do, is using np.random.choice
with the probabilities like you do but for the full size of you data. Then reindex
the df
with the new order you get from np.random.choice
. Use cumsum
on the column weight and finally return only the index until it reaches the value you want.
def weighted_budgeted_random_sample_all(df, budget):
random_index_order = np.random.choice( df.index, size = len(df),
p = df.prob, replace = False)
s = df.reindex(random_index_order).weight.cumsum()
return s[s <= budget].index.values
Now the problem with this method, is that with the df
as in the question and a budget
of 10, then some solutions are only the index 1 or 2 because if random_index_order
is equal to [2,1,0]
or [1,2,0]
then the cumsum
is higher than 10 at the second row.
See with a Counter
, the use of tuple
and np.sort
are just to make the Counter
work and easier to see the result:
from collections import Counter
print (Counter([ tuple(np.sort(weighted_budgeted_random_sample_all(df,10)))
for i in range(1000)]))
# Counter({(0, 1): 167, (0, 2): 111, (1,): 390, (2,): 332})
as you can see, some draws were in the order with 2 and 3 as the first 2 values and the result is only 2 or 3 because the sum of their weights is 11.
But actually, if you try the same thing with a budget of 11, then you get the expected output:
print (Counter([ tuple(np.sort(weighted_budgeted_random_sample_all(df,11)))
for i in range(1000)]))
# Counter({(0, 1): 169, (0, 2): 111, (1, 2): 720})
Here you find the three possibles set and the fact that the set {1,2}
is obtained more often make sense.
I saw that you revised your question after a comment saying you 'll work on the one-item-at-a-time approach. I believe than doing this will have an effect on the overall probabilities but I don't know enough in probabilities to say why. If you really want, then I assume you can combine your approach and mine to gain some time:
def weighted_budgeted_random_sample_mixed(df, budget):
ids = []
total = 0
dftemp = df.copy()
while total < budget:
remaining = budget - total
dftemp = dftemp[dftemp.weight <= remaining]
# Stop if there are no records with small enough weight.
if dftemp.shape[0] == 0:
break
# New order
new_index = np.random.choice( dftemp.index, size = len(dftemp),
p = (dftemp.prob/dftemp.prob.sum()),
replace = False)
s = dftemp.reindex(new_index).weight.cumsum()
#select only the necessary rows
s = s[s <= remaining]
total += s.max() #last value in s which is less than remaining
dftemp.drop(s.index, inplace=True)
ids += s.index.tolist()
return ids
Now for comparison with your method in terms of result:
#your approach
print (Counter([ tuple(np.sort(weighted_budgeted_random_sample(df,10)))
for i in range(1000)]))
#Counter({(0, 1): 546, (0, 2): 454})
#mixed approach
print (Counter([ tuple(np.sort(weighted_budgeted_random_sample_mixed(df,10)))
for i in range(1000)])
#Counter({(0, 1): 554, (0, 2): 446})
As you can see, the result are quite similar and the mixed approach should be faster on larger dataframe as it minimize looping in the while