9

EDIT: My question is: rand()%N is considered very bad, whereas the use of integer arithmetic is considered superior, but I cannot see the difference between the two.

People always mention:

  • low bits are not random in rand()%N,

  • rand()%N is very predictable,

  • you can use it for games but not for cryptography

Can someone explain if any of these points are the case here and how to see that?

The idea of the non-randomness of the lower bits is something that should make the PE of the two cases that I show differ, but it's not the case.

I guess many like me would always avoid using rand(), or rand()%N because we've been always taught that it is pretty bad. I was curious to see how "wrong" random integers generated with c rand()%N effectively are. This is also a follow up to Ryan Reich's answer in How to generate a random integer number from within a range.

The explanation there sounds very convincing, to be honest; nevertheless, I thought I’d give it a try. So, I compare the distributions in a VERY naive way. I run both random generators for different numbers of samples and domains. I didn't see the point of computing a density instead of histograms, so I just computed histograms and, just by looking, I would say they both look just as uniform. Regarding the other point that was raised, about the actual randomness (despite being uniformly distributed). I — again naively —compute the permutation entropy for these runs, which are the same for both sample sets, which tell us that there's no difference between both regarding the ordering of the occurrence.

So, for many purposes, it seems to me that rand()%N would be just fine, how can we see their flaws?

Here I show you a very simple, inefficient and not very elegant (but I think correct) way of computing these samples and get the histograms together with the permutation entropies. I show plots for domains (0,i) with i in {5,10,25,50,100} for different number of samples:

5 values, 5k samples

10 values 10k samples

25 values, 250k samples

100values,  1M samples

There's not much to see in the code I guess, so I will leave both the C and the matlab code for replication purposes.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

int main(int argc, char *argv[]){
        unsigned long max = atoi(argv[2]);
        int samples=atoi(argv[3]);
        srand(time(NULL));
        if(atoi(argv[1])==1){
                for(int i=0;i<samples;++i)
                        printf("%ld\n",rand()%(max+1));

        }else{
                for(int i=0;i<samples;++i){
                        unsigned long
                        num_bins = (unsigned long) max + 1,
                        num_rand = (unsigned long) RAND_MAX + 1,
                        bin_size = num_rand / num_bins,
                        defect   = num_rand % num_bins;

                        long x;
                        do {
                                x = rand();
                        }
                        while (num_rand - defect <= (unsigned long)x);
                        printf("%ld\n",x/bin_size);
                }
        }
        return 0;
}

And here is the Matlab code to plot this and compute the PEs (the recursion for the permutations I took it from: https://www.mathworks.com/matlabcentral/answers/308255-how-to-generate-all-possible-permutations-without-using-the-function-perms-randperm):

system('gcc randomTest.c -o randomTest.exe;');
max = 100;
samples = max*10000;
trials = 200;
system(['./randomTest.exe 1 ' num2str(max) ' ' num2str(samples) ' > file1'])
system(['./randomTest.exe 2 ' num2str(max) ' ' num2str(samples) ' > file2'])
a1=load('file1');
a2=load('file2');
uni = figure(1);
title(['Samples: ' num2str(samples)])
subplot(1,3,1)
h1 = histogram(a1,max+1);
title('rand%(max+1)')
subplot(1,3,2)
h2 = histogram(a2,max+1);
title('Integer arithmetic')
as=[a1,a2];
ns=3:8;
H = nan(numel(ns),size(as,2));
for op=1:size(as,2)
    x = as(:,op);
    for n=ns
        sequenceOcurrence = zeros(1,factorial(n));
        sequences = myperms(1:n);
        sequencesArrayIdx = sum(sequences.*10.^(size(sequences,2)-1:-1:0),2);
        for i=1:numel(x)-n
            [~,sequenceOrder] = sort(x(i:i+n-1));
            out = sequenceOrder'*10.^(numel(sequenceOrder)-1:-1:0).';
            sequenceOcurrence(sequencesArrayIdx == out) = sequenceOcurrence(sequencesArrayIdx == out) + 1;
        end
        chunks = length(x) - n + 1;
        ps = sequenceOcurrence/chunks;
        hh = sum(ps(logical(ps)).*log2(ps(logical(ps))));
        H(n,op) = hh/log2(factorial(n));
    end
end
subplot(1,3,3)
plot(ns,H(ns,:),'--*','linewidth',2)
ylabel('PE')
xlabel('Sequence length')
filename = ['all_' num2str(max) '_' num2str(samples) ];
export_fig(filename)
myradio
  • 1,703
  • 1
  • 15
  • 25
  • 3
    What is "bad"? It's bad for cryptography. But is OK for games (where not much money involved)... – Eugene Sh. Apr 17 '18 at 14:09
  • @MitchWheat I don't really get it, I am actually checking for that and I don't see any bias compared to the other method that is considered much better in that sense. – myradio Apr 17 '18 at 14:11
  • @STLDeveloper My question is exactly: rand()%N is considered very bad whereas the use of integer arithmetic is considered way superior but I cannot see where's the difference (which I honestly believe probably there is) – myradio Apr 17 '18 at 14:13
  • Then you should state that in the body of your question. – STLDev Apr 17 '18 at 14:14
  • Fair enough, I attempted that in the title but I guess I can add what I just told you to the body of the question. – myradio Apr 17 '18 at 14:16
  • I can't remember which one it was, but sometime within the last few months I used a system where `rand()%2` alternated 0,1,0,1... So it can definitely still be a problem, and code that aspires to both portability and good randomness unfortunately still can't trust the low-order bits of `rand()`. – Steve Summit Apr 17 '18 at 14:23
  • If `N` is `10000` and `RAND_MAX` is `32767`, you can expect the `[0..2767]` range to come much more often with `rand() % N`. – vgru Apr 17 '18 at 14:33
  • @Groo I'm not sure to understand. `N` as in `rand()%N` goes only up to 100 in what I show. The number of samples is of course much higher. – myradio Apr 17 '18 at 14:38
  • 1
    @Groo: Indeed, which proves that RAND_MAX is not that low. – Bathsheba Apr 17 '18 at 14:38
  • @myradio: presume that `RAND_MAX` is `110`, for simplicity. Then `rand() % 100` will return `0..99` when `rand() < 100`, or `0..10` when `rand() > 100`, i.e. the chance of getting a number within range `0..10` would be larger (`(10+10)/110`, i.e. `18%`). – vgru Apr 17 '18 at 14:42
  • So even if `RAND_MAX` is `32767`, but `N` is smallish, then it won't be that noticeable -- i.e. in `rand() % 100` you would get the "extra" `0..67` only when `rand() >= 32700`. – vgru Apr 17 '18 at 14:56
  • 1
    Does this answer your question? [Why do people say there is modulo bias when using a random number generator?](https://stackoverflow.com/questions/10984974/why-do-people-say-there-is-modulo-bias-when-using-a-random-number-generator) – Peter O. Jul 15 '20 at 16:23

3 Answers3

3

Due to the way modulo arithmetic works if N is significant compared to RAND_MAX doing %N will make it so you're considerably more likely to get some values than others. Imagine RAND_MAX is 12, and N is 9. If the distribution is good then the chances of getting one of 0, 1, or 2 is 0.5, and the chances of getting one of 3, 4, 5, 6, 7, 8 is 0.5. The result being that you're twice as likely to get a 0 instead of a 4. If N is an exact divider of RAND_MAX this distribution problem doesn't happen, and if N is very small compared to RAND_MAX the issue becomes less noticeable. RAND_MAX may not be a particularly large value (maybe 2^15 - 1), making this problem worse than you may expect. The alternative of doing (rand() * n) / (RAND_MAX + 1) also doesn't give an even distribution, however, it will be every mth value (for some m) that will be more likely to occur rather than the more likely values all being at the low end of the distribution.

If N is 75% of RAND_MAX then the values in the bottom third of your distribution are twice as likely as the values in the top two thirds (as this is where the extra values map to)

The quality of rand() will depend on the implementation of the system that you're on. I believe that some systems have had very poor implementation, OS Xs man pages declare rand obsolete. The Debian man page says the following:

The versions of rand() and srand() in the Linux C Library use the same random number generator as random(3) and srandom(3), so the lower-order bits should be as random as the higher-order bits. However, on older rand() implementations, and on current implementations on different systems, the lower-order bits are much less random than the higher- order bits. Do not use this function in applications intended to be portable when good randomness is needed. (Use random(3) instead.)

James Snook
  • 4,063
  • 2
  • 18
  • 30
  • What you say is true, *but*, most of the alternatives to doing `%N` also have this problem if N is not small compared to RAND_MAX. And this is not the problem people are usually referring to when they observe that the implementation of `rand()` may be poor. – Steve Summit Apr 17 '18 at 17:34
1

Both approaches have their pitfalls, and your graphs are little more than a pretty verification of the central limit theorem! For a sensible implementation of rand():

  1. % N suffers from a "pigeon-holing" effect if 1u + RAND_MAX is not a multiple of N

  2. /((RAND_MAX + 1u)/N) does not, in general, evenly distribute the return of rand across your range, due to integer truncation effects.

On balance, if N is small cf. RAND_MAX, I'd plump for % for its tractability. In any case test your generator to see it it has the appropriate statistical properties for your application.

Bathsheba
  • 231,907
  • 34
  • 361
  • 483
  • 2
    You can test the generator you're using today, but that doesn't tell you much about the generator your code might be compiled against tomorrow. – Steve Summit Apr 17 '18 at 14:26
  • Rather than a verification of the central limit theorem I would say it is a simple test to see whether the distributions are the comparably uniform. That is in a way my point, of course for small numbers the distributions differ but also for different draws of the same system. If the distributions really differ we should see that in the big sample numbers. – myradio Apr 17 '18 at 14:59
  • Anyways, you raised the very important point here, which is that this might depend on the N, and my comparison should have been made systematically for N's to have some value. – myradio Apr 17 '18 at 14:59
  • @SteveSummit, but the test will cover that, right? Unless the instruction was "test once locally, and never again?" – SO_fix_the_vote_sorting_bug Jul 17 '23 at 20:27
  • @SO_fix_the_vote_sorting_bug My point was that "plump for `%` for its tractability" is poor advice. Many `rand` implementations are nicely random in the low-order bits, so it's not surprising that the test may succeed. But many `rand` implementations are *not* random in the low-order bits, so if you later move your code to one of those, it may fail. – Steve Summit Jul 21 '23 at 11:14
1

rand() % N is considered extremely poor not because the distribution is bad, but because the randomness is poor-to-nonexistent. (If anything the distribution will be too good.)

If N is not small with respect to RAND_MAX, both

rand() % N

and

rand() / (RAND_MAX / N + 1)

will have more or less the same, poor distribution -- certain values will occur with significantly higher probability than others.

Looking at distribution histograms won't show you that for some implementations, rand() % N has a much, much worse problem -- to show that you'd have to perform some correlations with previous values. (For example, try taking rand() % 2, then subtracting from the previous value you got, and plotting a histogram of the differences. If the difference is never 0, you've got a problem.)

I would like to say that the implementations for which rand()'s low-order bits aren't random are simply buggy. I'd like to think that all those buggy implementations would have disappeared by now. I'd like to think that programmers shouldn't have to worry about calling rand()%N any more. But, unfortunately, my wishes don't change the fact that this seems to be one of those bugs that never get fixed, meaning that programmers do still have to worry.

See also the C FAQ list, question 13.16.

Steve Summit
  • 45,437
  • 7
  • 70
  • 103
  • But that's the reason I plot the PE as well. Shouldn't such correlation at least make a difference there? – myradio Apr 17 '18 at 15:21
  • @myradio Sorry, I don't know what you mean by PE. – Steve Summit Apr 17 '18 at 15:23
  • 1
    In any case, it's likely that the version of `rand` you tested didn't have the problem. But I believe that some, alas, still do -- and that there are still enough of them to make a difference. – Steve Summit Apr 17 '18 at 15:24
  • Sorry I thought I had explained it. PE stands for permutation entropy (I'm editing it now), which basically checks for the occurrences of different sequences and its (-) 1 for a normal distributed set of occurrences. – myradio Apr 18 '18 at 06:45