3

Foreword

It looks like it is a duplicate of few stackoverflow question but my situation is (probably) slightly unique.

My situation

I have a dictionary. The key is a string and the value is an integer.

I want the python script to randomly choose N number of keys.

The value is how likely it is to be chosen. The higher the key's value is, the higher the chance the key will be randomly chosen.

My solution

So using some other StackOverflow post and the power of the internet I managed to solve it using Weighted Random.

DICT_VAR= {'best':308281009, 'good':7066325, 'meh':26884, 'bad':71, 'terrible':16, 'never':0}

list_var = []
for i in DICT_VAR.keys():
    list_var.extend([i]*DICT_VAR[i])

print random.sample(list_var, 2) # get 2 random choice I suppose

The issue (the catch)

As you may notice, the value in the dictionary can be incredibly big (It can be unlimitedly big) and it can also be as small as 0 (zero is smallest, there is no negative number).

Running this code (with a little bigger numbers) resulted in my computer to freeze and be unresponsive until I hard reset it.

My question

How should I deal with the situation? Is there any other way of randomly choosing that is suitable for my situation since Weighted Random is the worst possible solution to this current case.

Programer Beginner
  • 1,377
  • 6
  • 21
  • 47

2 Answers2

5

I will assume here that a value of 0 means the key should never be chosen, the keys may be repeated in the sample (in the dictionary is irrelevant), and we may use a third-party module--numpy in this case. Here is code tested in Python 3.6.4 but I modified it so it should run in Python 2.7, but I can't test it that way.

DICT_VAR= {'best':308281009, 'good':7066325, 'meh':26884, 'bad':71,
           'terrible':16, 'never':0}

import numpy as np

keys, weights = zip(*DICT_VAR.items())
probs = np.array(weights, dtype=float) / float(sum(weights))
sample_np = np.random.choice(keys, 2, p=probs)
sample = [str(val) for val in sample_np]

Then sample holds your sample as a list of key strings. Note that your weight for key 'best' is so much larger than the other weights that your sample will almost always be ['best', 'best'].

To explain my code: first split the dictionary's keys (strings) and values (weights) into separate lists. Then change the weights to probabilities--larger weights give larger probabilities, a zero weight gives a zero probability. Then use numpy's choice function to choose a sample of keys using the probabilities as weights. The result is a numpy array, but you seem to want a standard Python list, so the final line converts the sample of keys into a standard list.

There is, of course, a fairly short routine that could be written in standard Python, so we could avoid the use of numpy. But it would most probably be slower.

The reason your routine was slow is that it builds a large list, with each key repeated the number of times given by its value, then a sample is chosen with uniform probability. With your sample data, that means building a huge list, much larger than your available RAM, and that takes much time. Numpy's choice routine can handle a non-uniform random distribution directly, without building another list.

Rory Daulton
  • 21,934
  • 6
  • 42
  • 50
  • Yes, your assumptions are mostly correct. If the value is 0, the key is never chosen. The keys are unique but I assume that does not influence your code right? And yes, as long as the module is supported on a Windows machine and works for Python 2.7 (and preferably is able to be installed using pip) than it's fine. – Programer Beginner Aug 12 '18 at 22:56
  • I am running Python 3.6.4 and do not have access to a Python 2.7 installation. Numpy can run in 2.7 but obviously something is different, either in numpy or in my code. (I think I figured it out--the division--try my new code.) By "the keys may be repeated" I mean that a given key may appear more than once in the samples. Whether or not is is unique in the input dictionary is irrelevant. – Rory Daulton Aug 12 '18 at 23:19
1

In Py 3.6 this is part of the standard library, with random.choices():

In []:
import random
random.choices(list(DICT_VAR.keys()), DICT_VAR.values(), k=2)

Out[]:
['best', 'best']

Or slightly more arcanely:

random.choices(*zip(*DICT_VAR.items()), k=2)
AChampion
  • 29,683
  • 4
  • 59
  • 75