Random Python dictionary key, weighted by values

Question

I have a dictionary where each key has a list of variable length, eg:

d = {
 'a': [1, 3, 2],
 'b': [6],
 'c': [0, 0]
}

Is there a clean way to get a random dictionary key, weighted by the length of its value? random.choice(d.keys()) will weight the keys equally, but in the case above I want 'a' to be returned roughly half the time.

Possible duplicate of [Weighted choice short and simple](http://stackoverflow.com/questions/10803135/weighted-choice-short-and-simple) — Jim G., Apr 12 '16 at 11:07

sth · Accepted Answer · 2009-06-29T01:01:28.620

35

This would work:

random.choice([k for k in d for x in d[k]])

edited Jun 29 '09 at 01:01

answered Jun 29 '09 at 00:48

sth

222,467
53
283
367

16

Python is the bomb diggity. – FogleBird Jun 29 '09 at 01:12
8

This has the same problem as David Seiler's answer. It'll use a lot of memory constructing that temporary list. – John Kugelman Jun 29 '09 at 02:08
I like this solution a lot, it's very concise. – James Thompson Jun 29 '09 at 07:25
You can avoid indexing d with k by using iteritems(): `random.choice([key for key, values in d.iteritems() for x in values])`. – hughdbrown Apr 09 '11 at 21:37

James Thompson · Answer 2 · 2009-06-29T07:10:27.210

Do you always know the total number of values in the dictionary? If so, this might be easy to do with the following algorithm, which can be used whenever you want to make a probabilistic selection of some items from an ordered list:

Iterate over your list of keys.
Generate a uniformly distributed random value between 0 and 1 (aka "roll the dice").
Assuming that this key has N_VALS values associated with it and there are TOTAL_VALS total values in the entire dictionary, accept this key with a probability N_VALS / N_REMAINING, where N_REMAINING is the number of items left in the list.

This algorithm has the advantage of not having to generate any new lists, which is important if your dictionary is large. Your program is only paying for the loop over K keys to calculate the total, a another loop over the keys which will on average end halfway through, and whatever it costs to generate a random number between 0 and 1. Generating such a random number is a very common application in programming, so most languages have a fast implementation of such a function. In Python the random number generator a C implementation of the Mersenne Twister algorithm, which should be very fast. Additionally, the documentation claims that this implementation is thread-safe.

Here's the code. I'm sure that you can clean it up if you'd like to use more Pythonic features:

#!/usr/bin/python

import random

def select_weighted( d ):
   # calculate total
   total = 0
   for key in d:
      total = total + len(d[key])
   accept_prob = float( 1.0 / total )

   # pick a weighted value from d
   n_seen = 0
   for key in d:
      current_key = key
      for val in d[key]:
         dice_roll = random.random()
         accept_prob = float( 1.0 / ( total - n_seen ) )
         n_seen = n_seen + 1
         if dice_roll <= accept_prob:
            return current_key

dict = {
   'a': [1, 3, 2],
   'b': [6],
   'c': [0, 0]
}

counts = {}
for key in dict:
   counts[key] = 0

for s in range(1,100000):
   k = select_weighted(dict)
   counts[k] = counts[k] + 1

print counts

After running this 100 times, I get select keys this number of times:

{'a': 49801, 'c': 33548, 'b': 16650}

Those are fairly close to your expected values of:

{'a': 0.5, 'c': 0.33333333333333331, 'b': 0.16666666666666666}

Edit: Miles pointed out a serious error in my original implementation, which has since been corrected. Sorry about that!

There are some pythonisms you could insert in there, but on the whole I like this approach. Good Work. — sykora, Jun 29 '09 at 02:34
You actually don't need to know the total number of values in the dictionary if using a "reservoir sampling" approach. See http://stackoverflow.com/questions/321637/rosetta-stone-reservoir-random-sampling-algorithm or http://www.cs.umd.edu/~samir/498/vitter.pdf — , Jun 29 '09 at 11:59
This line doesn't do anything: accept_prob = float( 1.0 / total ) — Draemon, Feb 18 '10 at 18:12
the only place it's used is here: 'if dice_roll <= accept_prob:' and two lines above is 'accept_prob = float( 1.0 / ( total - n_seen ) )' so the value of the first assignment is always overwritten. — Draemon, Feb 23 '10 at 10:59

score 9 · Answer 3 · answered Feb 23 '10 at 20:28

9

Without constructing a new, possibly big list with repeated values:

def select_weighted(d):
   offset = random.randint(0, sum(d.itervalues())-1)
   for k, v in d.iteritems():
      if offset < v:
         return k
      offset -= v

answered Feb 23 '10 at 20:28

sth

222,467
53
283
367

I'm using a similar situation for an app I'm writing where performance of this piece matters. This seems to be the most efficient solution. – Gattster Jul 08 '11 at 03:37
For anyone else using this: for Python 3, change `itervalues()` to `values()`, and `iteritems()` to `items()`. – Julia Ebert Jul 03 '20 at 15:32

score 6 · Answer 4 · answered Jun 29 '09 at 01:17

6

Given that your dict fits in memory, the random.choice method should be reasonable. But assuming otherwise, the next technique is to use a list of increasing weights, and use bisect to find a randomly chosen weight.

>>> import random, bisect
>>> items, total = [], 0
>>> for key, value in d.items():
        total += len(value)
        items.append((total, key))


>>> items[bisect.bisect_left(items, (random.randint(1, total),))][1]
'a'
>>> items[bisect.bisect_left(items, (random.randint(1, total),))][1]
'c'

answered Jun 29 '09 at 01:17

A. Coady

54,452
8
34
40

Is it possible to have a dictionary in Python that doesn't fit into memory, like a tied Perl hash? That's interesting, but I don't know exactly what you mean. – James Thompson Jun 29 '09 at 02:21
the dictionary will fit in memory, but this script will be running on a web server so I want to minimize the memory use – hoju Jun 29 '09 at 04:05
+1: This is the fastest and most efficient solution; if you pre-compute the "items" array, it can make a weighted random choice in O(log |d|) time – Miles Jun 29 '09 at 05:53
To JT and R: The dict isn't composed of counts, it already has enumerated values. Therefore the memory usage of an enumerated list of keys is bound by (and likely much smaller than) the dict itself. So I was merely trying to address the generally valid memory concern, while pointing out it's likely not an issue in this specific case. – A. Coady Jul 17 '09 at 19:00

David Seiler · Answer 5 · 2009-06-29T01:10:22.470

3

Make a list in which each key is repeated a number of times equal to the length of its value. In your example: ['a', 'a', 'a', 'b', 'c', 'c']. Then use random.choice().

Edit: or, less elegantly but more efficiently, try this: take the sum of the lengths of all values in the dictionary, S (you can cache and invalidate this value, or keep it up to date as you edit the dictionary, depending on the exact usage pattern you anticipate). Generate a random number from 0 to S, and do a linear search through the dictionary keys to find the range into which your random number falls.

I think that's the best you can do without changing or adding to your data representation.

edited Jun 29 '09 at 01:10

answered Jun 29 '09 at 00:48

David Seiler

9,557
2
31
39

My dictionaries are potentially huge so creating a new list would be expensive. Is there a cleaner way? – hoju Jun 29 '09 at 00:57
1

this doesn't seem like a good idea because it could potentially make a huge set of data – Nope Jun 29 '09 at 01:16

score 1 · Answer 6 · edited May 23 '17 at 12:16

Here is some code that is based on a previous answer I gave for probability distribution in python but is using the length to set the weight. It uses an iterative markov chain so that it does not need to know what the total of all of the weights are. Currently it calculates the max length but if that is too slow just change

  self._maxw = 1

to

  self._maxw = max lenght

and remove

for k in self._odata:
     if len(self._odata[k])> self._maxw:
          self._maxw=len(self._odata[k])

Here is the code.

import random


class RandomDict:
    """
    The weight is the length of each object in the dict.
    """

    def __init__(self,odict,n=0):
        self._odata = odict
        self._keys = list(odict.keys())
        self._maxw = 1  # to increase speed set me to max length
        self._len=len(odict)
        if n==0:
            self._n=self._len
        else:
            self._n=n
        # to increase speed set above max value and comment out next 3 lines
        for k in self._odata:
            if len(self._odata[k])> self._maxw:
                self._maxw=len(self._odata[k])


    def __iter__(self):
        return self.next()

    def next(self):
        while (self._len > 0) and (self._n>0):
            self._n -= 1
            for i in range(100):
                k=random.choice(self._keys)
                rx=random.uniform(0,self._maxw)
                if rx <= len(self._odata[k]): # test to see if that is the value we want
                    break
            # if you do not find one after 100 tries then just get a random one
            yield k

    def GetRdnKey(self):
        for i in range(100):
            k=random.choice(self._keys)
            rx=random.uniform(0,self._maxw)

            if rx <= len(self._odata[k]): # test to see if that is the value we want
                break
        # if you do not find one after 100 tries then just get a random one
        return k



#test code

d = {
 'a': [1, 3, 2],
 'b': [6],
 'c': [0, 0]
}


rd=RandomDict(d)

dc = {
 'a': 0,
 'b': 0,
 'c': 0
}
for i in range(100000):
    k=rd.GetRdnKey()
    dc[k]+=1

print("Key count=",dc)



#iterate over the objects

dc = {
 'a': 0,
 'b': 0,
 'c': 0
}

for k in RandomDict(d,100000):
    dc[k]+=1

print("Key count=",dc)

Test results

Key count= {'a': 50181, 'c': 33363, 'b': 16456}
Key count= {'a': 50080, 'c': 33411, 'b': 16509}

hughdbrown · Answer 7 · 2011-04-09T21:40:59.657

1

I'd say this:

random.choice("".join([k * len(d[k]) for k in d]))

This makes it clear that each k in d gets as many chances as the length of its value. Of course, it is relying on dictionary keys of length 1 that are characters....

Much later:

table = "".join([key * len(value) for key, value in d.iteritems()])
random.choice(table)

edited Apr 09 '11 at 21:40

answered Jul 17 '09 at 18:31

hughdbrown

47,733
20
85
108

score 0 · Answer 8 · answered Dec 03 '19 at 19:31

import numpy as np

my_dict = {
  "one": 5,
  "two": 1,
  "three": 25,
  "four": 14
}

probs = []

elements = [my_dict[x] for x in my_dict.keys()]
total = sum(elements)
probs[:] = [x / total for x in elements]
r = np.random.choice(len(my_dict), p=probs)

print(list(my_dict.values())[r])
# 25

Gattster · Answer 9 · 2011-07-08T04:50:55.243

I modified some of the other answers to come up with this. It's a bit more configurable. It takes 2 arguments, a list and a lambda function to tell it how to generate a key.

def select_weighted(lst, weight):
   """ Usage: select_weighted([0,1,10], weight=lambda x: x) """
   thesum = sum([weight(x) for x in lst])
   if thesum == 0:
      return random.choice(lst)
   offset = random.randint(0, thesum - 1)

   for k in lst:
      v = weight(k)
      if offset < v:
         return k
      offset -= v

Thanks to sth for the base code for this.

score 0 · Answer 10 · answered Mar 09 '21 at 19:51

Need to mention random.choices for Python 3.6+:

import random
raffle_dict = {"Person 1": [1,2], "Person 2": [1]}
random.choices(list(raffle_dict.keys()), [len(w[1]) for w in raffle_dict.items()], k=1)[0]

random.choices returns a list of samples, so k=1 if you only need one and we'll take the first item in the list. If your dictionary already has the weights, just get rid of the len or better yet:

raffle_dict = {"Person 1": 1, "Person 2": 10}
random.choices(list(raffle_dict.keys()), raffle_dict.values(), k=1)[0]

See also this question and this tutorial,

Note that this relies on the assumption that `keys()` and `values()` follow the same ordering of all items. Although this probably also works for most older versions of Python, [this is only officially required since Python 3.7](https://docs.python.org/3/whatsnew/3.7.html). — PattuX, Feb 06 '23 at 09:36

Random Python dictionary key, weighted by values

10 Answers10

Linked

Related