Randomly selecting letters by frequency of use

Question

After feeding few Shakespeare books to my Perl script I have a hash with 26 english letters as keys and the number of their occurences in texts - as value:

%freq = (
    a => 24645246,
    b => 1409459,
    ....
    z => 807451,
);

and of course the total number of all letters - let's say in the $total variable.

Is there please a nice trick to generate a string holding 16 random letters (a letter can occur several times there) - weighted by their frequency of use?

To be used in a word game similar to Ruzzle:

enter image description here

Something elegant - like picking a random line from a file, as suggested by a Perl Cookbook receipt:

rand($.) < 1 && ($line = $_) while <>;

Julien · Answer 1 · 2013-03-07T09:47:46.070

I have no clue about Perl syntax so I'll just write pseudo-code. You can do something like that

sum <= 0
foreach (letter in {a, z})
  sum <= sum + freq[letter]
pick r, a random integer in [0, sum[ 
letter <= 'a' - 1
do
  letter <= letter + 1
  r <= r - freq(letter)
while r > 0

letter is the resulting value

The idea behind this code is to make a stack of boxes for each letter. The size of each box is the frequency of the letter. Then we choose a random location on this stack and see which letter's box we landed.

Example :

freq(a) = 5
freq(b) = 3
freq(c) = 3
sum = 11

|    a    |  b  |  c  | 
 - - - - - - - - - - -

When we choose a 0 <= r < 11, we have the following probabilities

Pick a 'a' = 5 / 11
Pick a 'b' = 3 / 11
Pick a 'c' = 3 / 11

Which is exactly what we want.

Thanks, this makes sense. But I suspect there might be some kind of `rand(1/$total)` way too... — Alexander Farber, Mar 07 '13 at 11:53

Ilmari Karonen · Accepted Answer · 2013-03-07T12:10:52.050

The Perl Cookbook trick for picking a random line (which can also be found in perlfaq5) can be adapted for weighted sampling too:

my $chosen;
my $sum = 0;
foreach my $item (keys %freq) {
    $sum += $freq{$item};
    $chosen = $item if rand($sum) < $freq{$item};
}

Here, $sum corresponds to the line counter $. and $freq{$item} to the constant 1 in the Cookbook version.

If you're going to be picking a lot of weighted random samples, you can speed this up a bit with some preparation (note that this destroys %freq, so make a copy first if you want to keep it):

# first, scale all frequencies so that the average frequency is 1:
my $avg = 0;
$avg += $_ for values %freq;
$avg /= keys %freq;
$_ /= $avg for values %freq;

# now, prepare the array we'll need for fast weighted sampling:
my @lookup;
while (keys %freq) {
    my ($lo, $hi) = (sort {$freq{$a} <=> $freq{$b}} keys %freq)[0, -1];
    push @lookup, [$lo, $hi, $freq{$lo} + @lookup];
    $freq{$hi} -= (1 - $freq{$lo});
    delete $freq{$lo};
}

Now, to draw a random weighted sample from the prepared distribution, you just do this:

my $r = rand @lookup;
my ($lo, $hi, $threshold) = @{$lookup[$r]};
my $chosen = ($r < $threshold ? $lo : $hi);

(This is basically the Square Histogram method described in Marsaglia, Tsang & Wang (2004), "Fast Generation of Discrete Random Variables", J. Stat. Soft. 11(3) and originally due to A.J. Walker (1974).)

Should I always run through the whole loop OR call `last` if the `if` is true? — Alexander Farber, Mar 07 '13 at 12:02
@AlexanderFarber: You should always run through the whole loop, since the chosen item may (and usually will) change several times during the loop. — Ilmari Karonen, Mar 07 '13 at 12:11

score 2 · Answer 3 · edited May 23 '17 at 12:11

You can first built a table of the running sum of the frequency. So if you have the following data:

%freq = (
    a => 15,
    b => 25,
    c => 30,
    d => 20
);

the running sum would be;

%running_sums = (
    a => 0,  
    b => 15, 
    c => 40, # 15 + 25
    d => 70, # 15 + 25 + 30
);
$max_sum = 90; # 15 + 25 + 30 + 20

To pick a single letter with the weighted frequency, you need to select a number between [0,90), then you can do a linear search on the running_sum table for the range that includes the letter. For example, if your random number is 20 then the appropriate range is 15-40, which is for the letter 'b'. Using linear search gives a total running time of O(m*n) where m is the number of letters we need and n is the size of the alphabet (therefore m=16, n=26). This is essentially what @default locale do.

Instead of linear search, you can also do a binary search on the running_sum table to get the closest number rounded down. This gives a total running time of O(m*log(n)).

For picking m letters though, there is a faster way than O(m*log(n)), perticularly if n < m. First you generate m random numbers in sorted order (which can be done without sorting in O(n)) then you do a linear matching for the ranges between the list of sorted random numbers and the list of running sums. This gives a total runtime of O(m+n). The code in its entirety running in Ideone.

use List::Util qw(shuffle);

my %freq = (...);

# list of letters in sorted order, i.e. "a", "b", "c", ..., "x", "y", "z"
# sorting is O(n*log(n)) but it can be avoided if you already have 
# a list of letters you're interested in using
my @letters = sort keys %freq;

# compute the running_sums table in O(n)
my $sum = 0;
my %running_sum;
for(@letters) {
    $running_sum{$_} = $sum;
    $sum += $freq{$_};
}

# generate a string with letters in $freq frequency in O(m)
my $curmax = 1;
my $curletter = $#letters;
my $i = 16; # the number of letters we want to generate
my @result;
while ($i > 0) {
    # $curmax generates a uniformly distributed decreasing random number in [0,1)
    # see http://repository.cmu.edu/cgi/viewcontent.cgi?article=3483&context=compsci
    $curmax = $curmax * (1-rand())**(1. / $i);

    # scale the random number $curmax to [0,$sum)
    my $num = int ($curmax * $sum);

    # find the range that includes $num
    while ($num < $running_sum{$letters[$curletter]}) {
        $curletter--;
    }

    push(@result, $letters[$curletter]);

    $i--;
}

# since $result is sorted, you may want to use shuffle it first
# Fisher-Yates shuffle is O(m)
print "", join('', shuffle(@result));

Randomly selecting letters by frequency of use

3 Answers3