0

I have a series of text files of various (and often changing) lengths. The data stored in the file is in a particular order from most common (top) to least (bottom). I want to randomly select a line from the file weighted towards the top entries - for example, if there are 322 entries in a file, line 1 would be 322 times more likely to be chosen than line 322.

I've been appending the contents of the file to a list to get the length via the len function and then approaching it as a math problem, but I'm wondering (hoping) Python has a cleverer way to accomplish this?

Anto
  • 6,806
  • 8
  • 43
  • 65
Groundhog
  • 55
  • 1
  • 6
  • `The data stored in the file is in a particular order from most common (top) to least (bottom)` The most common *what* ? Can you give an example of file content ? – Anto Aug 05 '14 at 11:49
  • 1
    Each line of the file is a city, ranked in order of population (highest to lowest). – Groundhog Aug 05 '14 at 11:51
  • Perhaps this one can help: [Random Python dictionary key, weighted by values](http://stackoverflow.com/questions/1056151/random-python-dictionary-key-weighted-by-values) – Anto Aug 05 '14 at 11:58

3 Answers3

2

Assuming that you store the values in a dictionary like this:

cities = {
    'City1': 3298181,
    'City2': 3013491,
    'City3': 900129,
    ...
}

You can use the random library to do this:

from random import choice

choice([k for k in cities for x in xrange(cities[k])])

Explanation:

The generator inside choice will generate an iterable list object with every city name repeated as many times as the number of people living there.

Example:

 >>> cities = {'1': 3, '2': 1}
 >>> [k for k in cities for x in xrange(cities[k])]
 ['1', '1', '1', '2']

Be careful with this approach because if there are a lot of cities, each one with a lot of people, the array will become really huge.

Also DO NOT try to use range() instead of xrange() because it's not a generator and it will cause your pc to freeze because of the huge amount of data stored.

Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
1

The accepted answer does not appear to be aligned with the OP's requirements as written (although it might actually be so) so here is another answer that approaches the general problem of randomly selecting a line from a file with weighted probabilities. This comes from the random module examples in the Python 3 documentation.

In this case, line 1 of a file is to be selected with greater probability than the last line, and with reducing probability for intervening lines, so our weights would be range(n, 0, -1) where n is the number of lines in the file, e.g. if there were 5 lines in the file, then the weights would be [5, 4, 3, 2, 1] and this would correspond to probabilities of:

weights = range(5, 0, -1)
total_weights = float(sum(weights))
probabilities = [w/total_weights for w in weights]
>>> [round(p, 5) for p in probabilities]    # rounded for readability
[0.33333, 0.26667, 0.2, 0.13333, 0.06667]

So the first line has probability 5 times greater than the last line, with reducing probability for each line.

Next we need to construct a cumulative distribution based on the weights, select a random value within that distribution, locate the random value within the distribution, and use that to retrieve a line from the file. Here is some code that does that.

import bisect
import random
try:
    from itertools import accumulate     # Python >= 3.2
except ImportError:
    def accumulate(weights):
        accumulator = 0
        for w in weights:
            accumulator += w
            yield accumulator

def count(iterable):
    return sum(1 for elem in iterable)

def get_nth(iterable, n):
    assert isinstance(n, int), "n must be an integer, got %r" % type(n)
    assert n > 0, "n must be greater than 0, got %r" % n
    for i, elem in enumerate(iterable, 1):
        if i == n:
            return elem

def weighted_select(filename):
    with open(filename) as f:
        n = count(f)
        if n == 0:
            return None

        # set up cumulative distribution
        weights = range(n, 0, -1)
        cumulative_dist = list(accumulate(weights))

        # select line number
        x = random.random() * cumulative_dist[-1]
        selected_line = bisect.bisect(cumulative_dist, x)

        # retrieve line from file
        f.seek(0)
        return get_nth(f, selected_line + 1)    # N.B. +1 for nth line

This uses weights according to my interpretation of the question. It's easy enough to adapt this to other weights, e.g. if you wanted a weighted select with city population as the weights, you'd just change weights = range(n, 0, -1) to a list of populations corresponding to each line in the file.

mhawke
  • 84,695
  • 9
  • 117
  • 138
  • I got the first solution to work using in a fashion with a bit of tweaking (and about 15 more lines of code...), but this a very elegant solution that worked a treat. Thank you! – Groundhog Aug 08 '14 at 09:37
-1

This is a very famous scenario and i think it goes with the name of random sampling. The following code will work in python.

    from random import randint

    f = open(filename, 'r')
    out = None
    n = 0
    for line in f:
        n = n + 1
        if random()>1/n:
            out = line
    print out
  • This is not the right solution, because if you have a list of cities, for example, like: `1: 100, 2: 90, 3: 90, 4: 10` then the third city will have less chance of being chosen than the second one. – Marco Bonelli Aug 05 '14 at 12:12
  • I don't get why it will be wrong. As per the question the probability of selection lines in order is 1, 1/2, 1/3.... If n is the length of the file I want the probability of selection each line to be 1/n. Lets say n = 4. Now the probability of selecting 4th line is 1/4. Probability of selecting 3rd line is 1/3*(probability 4th line is not selected) = 1/3*(1-1/4) = 1/4. Thus the probability of selection each line is 1/4 which is the desired output. – Neerav Kumar Aug 05 '14 at 12:39
  • Really, reading the question again I got confused, because the OP asked for 1, 1/2, 1/3, 1/n. But then chosen my answer. IDK. – Marco Bonelli Aug 05 '14 at 12:43