6

I'm creating a python script that randomly picks 1000 names from the list of male first names located here: http://www.census.gov/genealogy/www/data/1990surnames/names_files.html

That works all fine and dandy, but I would like it so that the names were selected based on the probability column provided by the census text files (second column).

I've been trying to wrap my head around this for the past few hours, but I haven't made any real progress, even looking for other answers.

Can anybody help me out or point me in the right direction? Thanks in advance :)

IrateIrish
  • 219
  • 2
  • 6
  • 12
  • 1
    This might be helpful - http://stackoverflow.com/questions/352670/weighted-random-selection-with-and-without-replacement – Sukrit Kalra Mar 28 '14 at 20:02
  • 1
    Eli Bendersky's page on [weighted random selection](http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python/) in Python is very informative. – DSM Mar 28 '14 at 20:06
  • @DSM that page was _extremely_ helpful. Thank you! – IrateIrish Mar 28 '14 at 20:12

3 Answers3

6

An easy algorithm for weighted selection is:

  1. Assign each name its relative probability, such that the sum of all probabilities is 1. This relative value is called "weight".

  2. Select a random number between 0 and 1

  3. Walk the list, substracting the weight of each item from your number as you go

  4. When you go to 0 or below, pick the current item.

salezica
  • 74,081
  • 25
  • 105
  • 166
  • This could work, but the problem is(maybe) that I'm selecting from ~1200 names, 1000 times. Would this approach take a long time, then? – IrateIrish Mar 28 '14 at 20:20
  • 1
    You can't go much faster than this: it runs in linear time, with an almost minimal constant factor. The weights are, obviously, computed only once, before random picks are made – salezica Mar 28 '14 at 21:56
2

The third column of the data file is the cumulative probability, the running sum of the second column.

To select a random name with respect to the cumulative probability distribution:

  1. Generate a random number between 0 and 1,
  2. Find the first row whose cumulative probability is bigger than that random number.
  3. Select the name in that row.

import urllib2
import random
import bisect

url = 'http://www.census.gov/genealogy/www/data/1990surnames/dist.male.first'
response = urllib2.urlopen(url)
names, cumprobs = [], []
for line in response:
    name, prob, cumprob, rank = line.split()
    cumprob = float(cumprob)
    names.append(name)
    cumprobs.append(cumprob)

# normalize the cumulative probabilities to the range [0, 1]
cumprobs = [p/cumprobs[-1] for p in cumprobs]
# print(cumprobs)

# Generate 1000 names at random, using the cumulative probability distribution
N = 1000
selected = [names[bisect.bisect(cumprobs, random.random())] for i in xrange(N)]
print('\n'.join(selected))

Note, the alias method has better computational complexity, but for selection of a mere 1000 items, that may not be very important for your use case.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
0

A quick and VERY dirty hack that will work for smaller datasets is just to add the name in question a number of times equal to the weighted distribution. Note that this will consume a whole ton of memory, especially in larger datasets, so consider this as a quick implementation for small weighted distributions ONLY.

import random

filename = r"location/of/file"
data = list() # accumulator

with open(filename) as in_:
    for line in in_:
        name, prob, *_ = line.split()
        for _ in range(int(float(prob)*1000)):
            data.append(name)

print(random.choice(data))
Adam Smith
  • 52,157
  • 12
  • 73
  • 112