18

I have an array of N elements (representing the N letters of a given alphabet), and each cell of the array holds an integer value, that integer value meaning the number of occurrences in a given text of that letter. Now I want to randomly choose a letter from all of the letters in the alphabet, based on his number of appearances with the given constraints:

  • If the letter has a positive (nonzero) value, then it can be always chosen by the algorithm (with a bigger or smaller probability, of course).

  • If a letter A has a higher value than a letter B, then it has to be more likely to be chosen by the algorithm.

Now, taking that into account, I've come up with a simple algorithm that might do the job, but I was just wondering if there was a better thing to do. This seems to be quite fundamental, and I think there might be more clever things to do in order to accomplish this more efficiently. This is the algorithm i thought:

  • Add up all the frequencies in the array. Store it in SUM
  • Choosing up a random value from 0 to SUM. Store it in RAN
  • [While] RAN > 0, Starting from the first, visit each cell in the array (in order), and subtract the value of that cell from RAN
  • The last visited cell is the chosen one

So, is there a better thing to do than this? Am I missing something?

I'm aware most modern computers can compute this so fast I won't even notice if my algorithm is inefficient, so this is more of a theoretical question rather than a practical one.

I prefer an explained algorithm rather than just code for an answer, but If you're more comfortable providing your answer in code, I have no problem with that.

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
Setzer22
  • 1,589
  • 2
  • 13
  • 29

2 Answers2

19

The idea:

  • Iterate through all the elements and set the value of each element as the cumulative frequency thus far.
  • Generate a random number between 1 and the sum of all frequencies
  • Do a binary search on the values for this number (finding the first value greater than or equal to the number).

Example:

Element    A B C D
Frequency  1 4 3 2
Cumulative 1 5 8 10

Generate a random number in the range 1-10 (1+4+3+2 = 10, the same as the last value in the cumulative list), do a binary search, which will return values as follows:

Number   Element returned
1        A
2        B
3        B
4        B
5        B
6        C
7        C
8        C
9        D
10       D
Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
10

The Alias Method has amortized O(1) time per value generated, but requires two uniforms per lookup. Basically, you create a table where each column contains one of the values to be generated, a second value called an alias, and a conditional probability of choosing between the value and its alias. Use your first uniform to pick any of the columns with equal likelihood. Then choose between the primary value and the alias based on your second uniform. It takes a O(n log n) work to initially set up a valid table for n values, but after the table's built generating values is constant time. You can download this Ruby gem to see an actual implementation.

Two other very fast methods by Marsaglia et al. are described here. They have provided C implementations.

pjs
  • 18,696
  • 4
  • 27
  • 56
  • 1
    +1 an illustration of the Vose Alias here by Kent Beck https://www.facebook.com/note.php?note_id=323786247654246 – aka.nice Jun 28 '13 at 20:45
  • @aka.nice And in Smalltalk, too. Lovely! – pjs Jun 28 '13 at 21:17
  • +1. Thanks for sharing this - I have a project where I was using the binary search method described above, and your methods look to be a substantial improvement. Reading that Marsaglia paper, am I reading it right that Marsaglia method II is essentially just the Alias method? I was surprised that method I was faster; I don't suppose you can share any intuition as to why that was so? – Julian Oct 23 '14 at 00:47
  • 1
    To add to aka.nice's comment, Keith Schwarz has a visual explanation (linked to in the facebook note) at http://www.keithschwarz.com/darts-dice-coins/ It's really cool! – Mark Peschel May 07 '17 at 03:26
  • @pjs I see you changed your time claim from O(n) to O(n log n). However, Darts, Dice, and Coins claims the Vose Alias runs in O(n), and assuming O(1) amortized for adding and removing from a set, I agree. Do I misunderstand? – Mark Peschel May 07 '17 at 03:30
  • @pjs Never mind; I misunderstood. The time claim O(n) for building the table assumes that the elements have been sorted already; O(n log n) takes sorting time into account. – Mark Peschel Aug 30 '17 at 12:55