How can I increase the performance of a random sampling function from O(n) to O(logn) using a dictionary?

Question

I have to create a class that stores events and their probability of happening. I'm using a dictionary with the event as the key and the number of times that event occurs as the value. From that, I can easily find the probability of an event.

from random import randint

class Distribution:

    def __init__(self):
        self._d = {}
        self._events = 0

    def add(self,e,multiplicity = 1):
        self._d[e] = self._d.get(e,0) + multiplicity
        self._events += multiplicity

    def count(self,e):
        return self._d[e]

    def prob(self,e):
        return self._d.get(e,0)/self._events

    def sample(self):
        r = randint(0,self._events)
        for key in self._d:
            r -= self._d[key]
            if r <= 0:
                return key

    def __len__(self):
        return len(self._d)

d = Distribution()
d.add('a')
d.add('a')
d.add('a')
d.add('b')
d.add('b')
d.add('c')

d.prob('a') #returns 1/2
d.prob('b') #returns 1/3

d.sample() #returns a random even based on the probability associated with that event

Now, I have to optimize the sample function so it runs in O(logn) time. It can run in O(nlogn) the first time it runs after an event is added to the distribution. I can't think of any way to get it down to O(logn). Normally I associate logn with binary search, but I can't see how that applies here.

It would greatly help your question if a) your code were correctly indented so it can be clearly understood as Python and copied/pasted to test, and b) you include code to instantiate and test your class (which has to be instantiated to have an O(anything) performance) to demonstrate the results you currently get, and c) you indicate what you have actually tried - the research you have done - why your current code is O(n), etc. — DisappointedByUnaccountableMod, Nov 06 '18 at 22:03
If you're allowed O(nlogn) after an insert, you should consider sorting or binary tree operations. You could build a tree, where each leaf is an event, and each node sums the probability of the leaves in its left and right subtrees. Or you could calculate an array with cumulative probabilities, (eg [0.2,0.3,0.4,0.1] has the cdf of [0.2,0.5,0.9,1.0]) and binary search in that array for a `random()` float, using the index as the key to an event in a list. — , Nov 06 '18 at 22:28

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

Analyzing the actual method

Let's think the worst case scenario for your situation that is:

"Distribution of N events with the same probability p = 1/N like in situation of extraction a specific number from a basket of N numbers".

So we have that self._d is populated with N keys and for each is assigned a value of 1 while self.events is N too.

Having this in mind and calling n the size of our dictionary, let's look at your sample() method.
Its cost is that of "generating a random integer indicating occurrences of a event" plus "looping on every key to find one with a specific value".
Assuming that the cost is much bigger for looping than that of generating a random number let's overlook the second for now.
Your loop need, in the worst case, to look on every key before returning one because to r was assigned N for value and so it costs O(n*O(self._d[key])) where cost of retrieving a value in a simple dictionary is, based on this source, is O(n) in the worst case.

In the end your function will be O(n*O(n)) = O(n^2) while when the retrieving goes well you will have O(n*O(1)) = O(n) for final cost. On dict implementation that costs O(logn) retrieving, then is like you said final cost will be O(nlogn).

Possible solution

Considering the previous reasoning if we find an implementation of a dictionary in python with a constant cost O(1) for retrieving a value by key, then will reduce the method cost to O(n) that is more efficient than O(n^2) that you have. These are the way I can thing to speed up your function but it will never be O(logn) because of r in the worst case we loop on every key before returning one.

As example let's say we have this dictionary after some insertion d1 = {"a":1, "b":1, "c":1} and randint() assign r=3. Now what will happen is that we take a key, maybe b and subtract its value resulting in r=2 that don't pass the if condition and so the next one, but the final one yes. Thus with a big dictionary like d1 you'll loop on n elements.

However, if you want that sample returns an event that has for value the first causal r that you generate than I have a solution which includes using binary search. For these let's use some supporting structure two Python lists: one for maintain the keys that you insert that I'll call labels for now and another one for the values that I'll call data. Ordering data list, contemporary labels too, so a dictionary (key,value) pair components will be in the same position of the the two lists, then use a "Binary Search" to find r in O(logn) and with the founded position return the corresponding key of from labels list. Following is my code that need to import the module ordered to work, tip on how order the output of dictionary by value here.

 def fastSample(self):
        #supporting data structures
        labels = [] #to contain the keys
        data = [] #to contain the values

        #retrieving the pairs in the dictionary
        #ordered by values
        ordered_pairs = sorted(self._d.items(), key=operator.itemgetter(1))

        #Having our ordered list o pairs by value
        #I split it in two lists
        for pair in ordered_pairs:
          labels.append(pair[0])
          data.append(pair[1])

        r = choice(data) #take a random number between the possible values
        index = binarySearch(data,r)
        print(index)

        return labels[index]

This solution is capable of finding a key with the casual r that we generate, but now in respect to before there is the need to be sure that the number returned is a value of our dictionary. To do so is necessary using random.choice() that will choose casually a number from our data list that are the values in our dictionary. The last thing is that sorted() function has a cost that I don't know, but I'm sure that at best is O(n) or O(nlogn) see costs of sorting algorithms here and since its more expensive than the search we are using the cost of fastSample() will be that of sorting.

It easy however to improve from here, we only need to move out the two list make them like instances variables in __init__ of the class. Now when adding an event we must modify the lists, so they are always ordered.
With this last solution there the need of using more memory but in this way we wouldn't need to order them before calling binary search and our fastSample() will cost O(logn) like you wanted. The only problem, based on cases, can be that with same value for every key binary search will return the element in the center of the list.

Some outputs

Following there are some outputs while using fastSample() to show its results.

labels: ['e', 'f', 'h', 'c', 'a', 'b']
data: [1, 1, 1, 2, 6, 7]
r = 7
Lucky one is: b

labels: ['e', 'f', 'h', 'c', 'a', 'b']
data: [1, 1, 1, 2, 6, 7]
r = 1
Lucky one is: h

labels: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q']
data: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
r = 1
Lucky one is: h

How can I increase the performance of a random sampling function from O(n) to O(logn) using a dictionary?

1 Answers1

Analyzing the actual method

Possible solution

Some outputs