2

If I evaluate something like:

numpy.random.choice(2, size=100000, p=[0.01, 0.99])

using one uniformly-distributed random float, say r, and deciding if r < 0.01 will presumably waste many of the random bits (entropy) generated. I've heard (second-hand) that generating psuedo-random numbers is computationally expensive, so I assumed that numpy would not be doing that, and rather would use a scheme like arithmetic coding in this case.

However, at first glance it appears that choice does indeed generate a float for every sample it is asked for. Further, a quick timeit experiment shows that generating n uniform floats is actually quicker than n samples from p=[0.01, 0.99].

>>> timeit.timeit(lambda : numpy.random.choice(2, size=100000, p=[0.01, 0.99]), number=1000)
1.74494537999999
>>> timeit.timeit(lambda : numpy.random.random(size=100000), number=1000)
0.8165735180009506

Does choice really generate a float for each sample, as it would appear? Would it not significantly improve performance to use some compression algorithm in some cases (particularly if size is large and p is distributed unevenly)? If not, why not?

Ehsan
  • 12,072
  • 2
  • 20
  • 33
L.Grozinger
  • 2,280
  • 1
  • 11
  • 22
  • one of the great things about python is most of the packages can be easily found as open source so you can investigate these questions easily enough, it seems as though you have already answered your own question maybe, but you can see exactly how numpy.random.choice is implemented @ https://github.com/numpy/numpy/blob/master/numpy/random/mtrand.pyx#L805 – Joran Beasley Jul 30 '20 at 19:43
  • Perhaps I should edit the title. The question is really why does it make sense to generate more random bits than you need. – L.Grozinger Jul 30 '20 at 19:45
  • 1
    that seems like a great issue you should file on the gitlab listed above and submit a MR to numpy for :)... then you can even say you contributed to a package used by millions (probably millions... maybe only 100's of thousands) – Joran Beasley Jul 30 '20 at 19:47
  • @JoranBeasley :) ok will do. I was only hoping some wiser soul would point out the error in my reasoning above – L.Grozinger Jul 30 '20 at 19:50
  • I honestly dont know ... random is close to a magic black box as far as im concerned ... and seems "fast enough" for me – Joran Beasley Jul 30 '20 at 19:53
  • Generating pseudo-random numbers isn’t particularly expensive with many decent RNGs, no. (I don’t know which one NumPy uses by default, though – is it PCG64 (fast)? Did that change recently?) But how would arithmetic encoding save on calls to the RNG? – Ry- Jul 30 '20 at 19:56
  • Most of us can't answer "why" questions because we aren't developers and haven't followed all the development issues (going back 10-20 years). But my guess is that no one has felt the need or interest to optimizing the performance of `choice` (either in the old or new versions). I see a lot more examples (on SO) of uniform random floats and integers than uses of `choice. "why" questions are borderline opinion questions. – hpaulj Jul 31 '20 at 05:35

1 Answers1

5

Since NumPy 1.17, the reason is largely backward compatibility. See also this question and this question.

As of NumPy 1.17, numpy.random.* functions, including numpy.random.choice, are legacy functions and "SHALL remain the same as they currently are", according to NumPy's new RNG policy, which also introduced a new random generation system for NumPy. The reasons for making them legacy functions include the recommendation to avoid global state. Even so, however, NumPy did not deprecate any numpy.random.* functions in version 1.17, although a future version of NumPy might.

Recall that in your examples, numpy.random.choice takes an array of floats as weights. An array of integer weights would lead to more exact random number generation. And although any float could be converted to a rational number (leading to rational-valued weights and thus integer weights), the legacy NumPy version appears not to do this. These and other implementation decisions in numpy.random.choice can't be changed without breaking backward compatibility.

By the way, arithmetic coding is not the only algorithm that seeks to avoid wasting bits. Perhaps the canonical algorithm for sampling for a discrete distribution is the Knuth and Yao algorithm (1976), which exactly chooses a random integer based on the binary expansion of the probabilities involved, and treats the problem as a random walk on a binary tree. (This algorithm uses, on average, up to 2 bits away from the theoretical lower bound.) Any other integer generating algorithm can be ultimately described in the same way, namely as a random walk on a binary tree. For example, the Fast Loaded Dice Roller is a recent algorithm that has a guaranteed bound on the average number of bits it uses (in this case, no more than 6 bits away from the theoretical lower bound). The Han and Hoshi algorithm (from 1997) is another of this kind, but uses cumulative probabilities. See also my section, "Weighted Choice With Replacement".

Peter O.
  • 32,158
  • 14
  • 82
  • 96
  • Great answer. Upvoted. Just curious, is it not possible to include both a new version and old one for backward compatibility dependent on Numpy version? – Ehsan Jul 30 '20 at 20:18
  • This is interesting. I wonder if there would be any interest in a weighted choice implementation for numpy? (or if one is hidden somewhere already?) – L.Grozinger Jul 30 '20 at 20:19
  • 1
    @Ehsan: `numpy.random.Generator.choice` (the new version of `numpy.random.choice`) and other methods in `numpy.random.Generator` may have a different implementation in different NumPy versions. – Peter O. Jul 30 '20 at 20:21
  • 1
    Hmm, but why didn't they use (something like) it before that legacy decision? Are there reasons against using it? Like, is it not worth it because the overhead work takes more time than generating extra random bits would? – superb rain Jul 30 '20 at 20:30
  • @superbrain: You will have to ask the creator of NumPy. I don't know the answer, and whatever answer I can give will be speculation. – Peter O. Jul 30 '20 at 20:33
  • @superbrain there will be overhead of course, but my guess is that there is a bound on the entropy of your RV, after which you start winning on performance. The question is where that bound would have to be, and given that bound whether its then worth taking the time to implement. – L.Grozinger Jul 30 '20 at 20:45