4

I have dictionary of about 20,000 objects The key is a string representation of the object, and the value is the object itself. Each object has attributes self.length and self.rate. self.rate is calculated as 1.5E-8*self.length.

I need to select, with replacement, a pre-determined number (we'll say 500 for this example) of items from this dict based on their rate. objects with a lower rate will be less likely to be chosen and objects with a higher rate, more likely.

The way I thought I could do this is very slow.

In a while loop, while the number of selected objects is less than the number of required selections, I generate a random number between 0, and the length of the dict and choose that element. Then I generate another random number and if the random number is less than the rate of the chosen object in the list, that gets added to the selected objects. This seemed fine at first but now I am realising it's much too slow. Does anyone have suggestions on how to do this faster?

Some code: The class definition for the object

from numpy import random
class object():
    def __init__(self, length):
        self.length  = length
        self.rate = (1.15E-8*self.length)

    def select(self):
        x = random.uniform(0,1)
        if(x<self.rate):
            return True
        else:
            return False

And the function (in another module) that does the rest:

def select_random(object_dict,maxselect):
    nselect = 0
    object_names = object_dict.keys()
    selected_objects = []
    while(nselect < maxselect):
        x = random.randint(0,len(object_dict))
        if(object_dict[object_names[x]].select()):
            nselect +=1
            selected_objects.append(object_names[x])
    return(selected_objects)

I think what is making it really slow is that probability of each object being chosen is so small that there needs to many iterations before even one object is chosen let alone 500 or possibly more.

Distribution of lengths:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     51     822    1311    1770    2112  103000 
Davy Kavanagh
  • 4,809
  • 9
  • 35
  • 50
  • 1
    Can you describe the objects length distribution? if most of the objects length is below 43478260.86956522 then you it will take a time to choose a big number of such objects, additionally can you estimate the amount of objects? – Michael Aug 01 '12 at 11:13
  • give some sample objects. also, please explain more about what you mean when you say it should be random, but not so random, it seems you want the random to only pick items that randomly are higher than a random number, which is random. so there is no control over the randomness, just luck. – Inbar Rose Aug 01 '12 at 11:14
  • You should introduce a third attribute to `object`, calculate its random value based on `length`, then sort the objects according to the attribute value and take first `n` (`objects[:n]`) – wroniasty Aug 01 '12 at 11:17
  • Would the answers to my question help you? http://stackoverflow.com/questions/2570690/python-algorithm-to-randomly-select-a-key-based-on-proportionality-weight – Mathieu Dhondt Aug 01 '12 at 11:20
  • @Michael Amount of abjects is 19070 exactly. – Davy Kavanagh Aug 01 '12 at 11:37
  • @Inbar rose. The objects are selected on the basis of their rate which is proportional to their length. If the random numbers generated are uniform, then that if the generated random will be less likely to be lower than the rate of objects with a smaller rate. No? – Davy Kavanagh Aug 01 '12 at 11:37

5 Answers5

2

Try this:

import numpy as np    # requires NumPy 1.7 (!)

def select_random(object_dict, n):
    keys = object_dict.keys()
    rate = np.array([x.rate for x in keys])
    prob = rate / rate.sum()
    return np.random.choice(keys, size=n, replace=True, p=prob)

(Documentation)

P.S., it's a bad idea to call a class object, since that's also the name of the built-in universal base class.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • I didn't actually call the class object, just for the example here. But I didn't know that. Thanks! – Davy Kavanagh Aug 01 '12 at 11:28
  • I gather 1.7 is a developement version, and I have tried to install it on my mac, but to no avail. There don't seem to be any decent instructions for stupid people (me) to help one through this. Is there a base python function that can do this. Sorry to be a pain. – Davy Kavanagh Aug 01 '12 at 12:24
  • @DavyKavanagh: no, but there are the answers to [this question](http://stackoverflow.com/questions/352670/weighted-random-selection-with-and-without-replacement). (I understand the NumPy issues, I've contributed tidbits to that project but have to think really hard when installing it too.) – Fred Foo Aug 01 '12 at 12:26
  • Thanks for the link though. It led to a few different websites where I found [this:](http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python/) Hopefully it works ok. The logic seems sound but I will have to test if first. – Davy Kavanagh Aug 01 '12 at 15:17
1

I don't know if this method would be faster but it would be more accurate:

  1. do a cumsum on the length and save it to a list called cumsum
  2. assuming the lengthes are integers (otherwise you would have to normalize and choose a number between 0 and 1) choose a random number between 0 and the last element of cumsum
  3. go over cumsum and take the index of the first element that is smaller or equal to the number you chose.
  4. go to step 2. to choose another number.

let's say the lengths are [1,4,2,10,5] then cumsum would be: [1,5,7,17,22] now you randomly choose a number between 0 and 22 - you would get element i with probability of lengeths[i]/cumsum[-1] which sounds more accurate to me.

zenpoy
  • 19,490
  • 9
  • 60
  • 87
1

By incrementally summing up the weights of the items, you can pick one randomly according to the weights by choosing a random number uniformly in [0, T) where T is the total of all the weights and taking the item that's the first with a larger total than that (by for example binary chop). If you want a larger sample you can either repeat this, or like this code does sort the random numbers and do a merge-sort like step. The complexity is the same, but the code is a bit simpler I think as binary chop is always error-prone.

import random

def accumulate_weights(weighted_items):
    T = 0.0
    for w, i in weighted_items:
        T += w
        yield (T, i)

def sample_weighted(weighted_items, n):
    cumulative = list(accumulate_weights(weighted_items))
    T = cumulative[-1][0]
    i = 0
    for sample in sorted(random.uniform(0, T) for _ in xrange(n)):
        while sample > cumulative[i][0]:
            i += 1
        yield cumulative[i][1]

r = list(sample_weighted([(1.0, 'a'), (2.0, 'b'), (5.0, 'c'), (1.0, 'd')], 10000))
print [(x, r.count(x)) for x in 'abcd']

In case it's not obvious, you can use your 'rates' as weights. When you have one object with a rate of 0.15 and another of 0.3, all that matters is that that the second appears twice as often as the first. That's what the weights do in this code!

Paul Hankin
  • 54,811
  • 11
  • 92
  • 118
0

Your rates are between 5.865e-07 and 0.0011845 , and your uniform random selection is between 0 and 1, i believe that you will be lucky if you would able to select 500 objects based on the Median that is 1311.

you need to change your random selection

x = random.uniform(0,1)

to be

import random
x = random.triangular(51, 103000 , 1311 )
Michael
  • 2,827
  • 4
  • 30
  • 47
-2

if you need enough objects you could write the select function this way:

def select(self):
  x = randint(0,self.length)
  if x > self.legth - c:
   return False
  return True

this way the probability will depend on the constant c and the length (that reflects to the rate)

elyashiv
  • 3,623
  • 2
  • 29
  • 52