Analyzing the actual method
Let's think the worst case scenario for your situation that is:
"Distribution of N events with the same probability p = 1/N like in situation of extraction a specific number from a basket of N numbers".
So we have that self._d
is populated with N keys and for each is assigned a value of 1 while self.events
is N too.
Having this in mind and calling n the size of our dictionary, let's look at your sample()
method.
Its cost is that of "generating a random integer indicating occurrences of a event" plus "looping on every key to find one with a specific value".
Assuming that the cost is much bigger for looping than that of generating a random number let's overlook the second for now.
Your loop need, in the worst case, to look on every key before returning one because to r
was assigned N for value and so it costs O(n*O(self._d[key]))
where cost of retrieving a value in a simple dictionary is, based on this source, is O(n)
in the worst case.
In the end your function will be O(n*O(n))
= O(n^2) while when the retrieving goes well you will have O(n*O(1))
= O(n) for final cost. On dict implementation that costs O(logn)
retrieving, then is like you said final cost will be O(nlogn).
Possible solution
Considering the previous reasoning if we find an implementation of a dictionary in python with a constant cost O(1) for retrieving a value by key, then will reduce the method cost to O(n) that is more efficient than O(n^2) that you have.
These are the way I can thing to speed up your function but it will never be O(logn) because of r
in the worst case we loop on every key before returning one.
As example let's say we have this dictionary after some insertion
d1 = {"a":1, "b":1, "c":1}
and randint()
assign r=3
. Now what will happen is that we take a key, maybe b and subtract its value resulting in r=2 that don't pass the if condition and so the next one, but the final one yes. Thus with a big dictionary like d1 you'll loop on n elements.
However, if you want that sample returns an event that has for value the first causal r
that you generate than I have a solution which includes using binary search.
For these let's use some supporting structure two Python lists: one for maintain the keys that you insert that I'll call labels for now and another one for the values that I'll call data.
Ordering data list, contemporary labels too, so a dictionary (key,value) pair components will be in the same position of the the two lists, then use a "Binary Search" to find r
in O(logn) and with the founded position return the corresponding key of from labels list.
Following is my code that need to import the module ordered to work, tip on how order the output of dictionary by value here.
def fastSample(self):
#supporting data structures
labels = [] #to contain the keys
data = [] #to contain the values
#retrieving the pairs in the dictionary
#ordered by values
ordered_pairs = sorted(self._d.items(), key=operator.itemgetter(1))
#Having our ordered list o pairs by value
#I split it in two lists
for pair in ordered_pairs:
labels.append(pair[0])
data.append(pair[1])
r = choice(data) #take a random number between the possible values
index = binarySearch(data,r)
print(index)
return labels[index]
This solution is capable of finding a key with the casual r
that we generate, but now in respect to before there is the need to be sure that the number returned is a value of our dictionary. To do so is necessary using random.choice()
that will choose casually a number from our data list that are the values in our dictionary.
The last thing is that sorted()
function has a cost that I don't know, but I'm sure that at best is O(n)
or O(nlogn)
see costs of sorting algorithms here and since its more expensive than the search we are using the cost of fastSample() will be that of sorting.
It easy however to improve from here, we only need to move out the two list make them like instances variables in __init__
of the class. Now when adding an event we must modify the lists, so they are always ordered.
With this last solution there the need of using more memory but in this way we wouldn't need to order them before calling binary search and our fastSample()
will cost O(logn) like you wanted. The only problem, based on cases, can be that with same value for every key binary search will return the element in the center of the list.
Some outputs
Following there are some outputs while using fastSample() to show its results.
labels: ['e', 'f', 'h', 'c', 'a', 'b']
data: [1, 1, 1, 2, 6, 7]
r = 7
Lucky one is: b
labels: ['e', 'f', 'h', 'c', 'a', 'b']
data: [1, 1, 1, 2, 6, 7]
r = 1
Lucky one is: h
labels: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q']
data: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
r = 1
Lucky one is: h