123

I was implementing a hashmap in C as part of a project I'm working on and using random inserts to test it. I noticed that rand() on Linux seems to repeat numbers far more often than on Mac. RAND_MAX is 2147483647/0x7FFFFFFF on both platforms. I've reduced it to this test program that makes a byte array RAND_MAX+1-long, generates RAND_MAX random numbers, notes if each is a duplicate, and checks it off the list as seen.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

int main() {
    size_t size = ((size_t)RAND_MAX) + 1;
    char *randoms = calloc(size, sizeof(char));
    int dups = 0;
    srand(time(0));
    for (int i = 0; i < RAND_MAX; i++) {
        int r = rand();
        if (randoms[r]) {
            // printf("duplicate at %d\n", r);
            dups++;
        }
        randoms[r] = 1;
    }
    printf("duplicates: %d\n", dups);
}

Linux consistently generates around 790 million duplicates. Mac consistently only generates one, so it loops through every random number that it can generate almost without repeating. Can anyone please explain to me how this works? I can't tell anything different from the man pages, can't tell which RNG each is using, and can't find anything online. Thanks!

Theron S
  • 1,343
  • 1
  • 9
  • 13
  • 1
    Try with the same seed, instead of using `time(0)`, to get comparable results. Have you already read those: https://stackoverflow.com/questions/26440252/is-rand-really-that-bad https://stackoverflow.com/questions/52869166/why-is-the-use-of-rand-considered-bad/52881465 and https://crypto.stackexchange.com/questions/15662/how-vulnerable-is-the-c-rand-in-public-cryptography-protocols ? – Bob__ Apr 24 '20 at 15:14
  • 4
    Since rand() returns values from 0..RAND_MAX inclusive, your array needs to be sized RAND_MAX+1 – Blastfurnace Apr 24 '20 at 15:19
  • @Blastfurnace you're right, was being lazy as it wouldn't let me declare an array bigger than RAND_MAX. Fixed, and still getting same results. – Theron S Apr 24 '20 at 15:27
  • 23
    You might have noticed that RAND_MAX/e ~= 790 million. Also the limit of (1-1/n)^n as n approaches infinity is 1/e. – David Schwartz Apr 24 '20 at 15:28
  • @Bob__ I have tried with identical seeds and get the same results as when seeding with `time(0)`. There is much interesting in those links but I don't see that they answer my specific question. – Theron S Apr 24 '20 at 15:31
  • 3
    @DavidSchwartz If I understand you correctly, that may explain why the number on Linux is consistently around 790 million. I guess the question then is: why/how does Mac *not* repeat that many times? – Theron S Apr 24 '20 at 15:33
  • 28
    There is no quality requirement for the PRNG in the runtime library. Only real requirement is repeatability with same seed. *Apparently, the quality of the PRNG in your linux is better than in your Mac.* – pmg Apr 24 '20 at 15:35
  • 1
    "Mac consistently only generates one" Is it the *same* one? Is it somehow related to the seed? It looks more a shuffle than a random generator. – Bob__ Apr 24 '20 at 15:37
  • @Bob__ no, the single duplicated number on Mac changes each time. – Theron S Apr 24 '20 at 15:39
  • 1
    @pmg Do you know how the Mac's PRNG can go through each possible random number without repeating? Or how I can find out which PRNG it is? I know they're different PRNGs, I'm just trying to find out how exactly. – Theron S Apr 24 '20 at 15:42
  • Don't know. Some thing along the lines of `return seed = (seed + ) % (RAND_MAX + 1);` perhaps? :) – pmg Apr 24 '20 at 15:45
  • "Mac consistently only generates one" --> It would be interesting to 1) know which value is repeated (already discussed) and 2) know which value is omitted and 3) if those 2 are related. – chux - Reinstate Monica Apr 24 '20 at 15:52
  • 1
    @chux... my guess... `0` omitted, seed repeated – pmg Apr 24 '20 at 15:55
  • 1
    @pmg Consistently omitting 0 would be _bad_. – chux - Reinstate Monica Apr 24 '20 at 15:56
  • 4
    @chux Yes, but since it's based on multiplication, the state can never be zero or the result (next state) would also be zero. Based on the source code it does check for zero as a special case if seeded with zero, but it doesn't ever produce zero as part of the sequence. – Arkku Apr 24 '20 at 16:16
  • @Arkku That's an assumption about the implementation of rand() - and apparently we are discussing at least two different PRNGs here. `man rand` says "The rand() function returns a pseudo-random integer in the range 0 to RAND_MAX inclusive (i.e., the mathematical range [0, RAND_MAX])" -- Even simple affine linear PRNGs (which Mac seems to use, according to the question) allow $0$ as value – Hagen von Eitzen Apr 26 '20 at 10:59
  • 2
    @HagenvonEitzen I'm speaking specifically about the implementation of the macOS `rand()` referenced in the question. Its source code is in my answer. – Arkku Apr 26 '20 at 11:01
  • @Arkku I stand corrected (but will leave my comment in in case others stumble the same way I did) – Hagen von Eitzen Apr 26 '20 at 11:12

4 Answers4

146

While at first it may sound like the macOS rand() is somehow better for not repeating any numbers, one should note that with this amount of numbers generated it is expected to see plenty of duplicates (in fact, around 790 million, or (231-1)/e). Likewise iterating through the numbers in sequence would also produce no duplicates, but wouldn't be considered very random. So the Linux rand() implementation is in this test indistinguishable from a true random source, whereas the macOS rand() is not.

Another thing that appears surprising at first glance is how the macOS rand() can manage to avoid duplicates so well. Looking at its source code, we find the implementation to be as follows:

/*
 * Compute x = (7^5 * x) mod (2^31 - 1)
 * without overflowing 31 bits:
 *      (2^31 - 1) = 127773 * (7^5) + 2836
 * From "Random number generators: good ones are hard to find",
 * Park and Miller, Communications of the ACM, vol. 31, no. 10,
 * October 1988, p. 1195.
 */
    long hi, lo, x;

    /* Can't be initialized with 0, so use another value. */
    if (*ctx == 0)
        *ctx = 123459876;
    hi = *ctx / 127773;
    lo = *ctx % 127773;
    x = 16807 * lo - 2836 * hi;
    if (x < 0)
        x += 0x7fffffff;
    return ((*ctx = x) % ((unsigned long) RAND_MAX + 1));

This does indeed result in all numbers between 1 and RAND_MAX, inclusive, exactly once, before the sequence repeats again. Since the next state is based on multiplication, the state can never be zero (or all future states would also be zero). Thus the repeated number you see is the first one, and zero is the one that is never returned.

Apple has been promoting the use of better random number generators in their documentation and examples for at least as long as macOS (or OS X) has existed, so the quality of rand() is probably not deemed important, and they've just stuck with one of the simplest pseudorandom generators available. (As you noted, their rand() is even commented with a recommendation to use arc4random() instead.)

On a related note, the simplest pseudorandom number generator I could find that produces decent results in this (and many other) tests for randomness is xorshift*:

uint64_t x = *ctx;
x ^= x >> 12;
x ^= x << 25;
x ^= x >> 27;
*ctx = x;
return (x * 0x2545F4914F6CDD1DUL) >> 33;

This implementation results in almost exactly 790 million duplicates in your test.

Arkku
  • 41,011
  • 10
  • 62
  • 84
  • 6
    A [journal article](https://dl.acm.org/doi/abs/10.1016/0167-6377%2886%2990092-1) published in the 1980's proposed a statistical test for PRNGs based on "birthday problem". – pjs Apr 24 '20 at 17:37
  • 17
    "Apple has been promoting the use of better random number generators in their documentation" --> of course Apple could employ `arc4random()` like code behind `rand()` and get a good `rand()` result. Rather than try to steer programmers to code differently, just create better library functions. "they've just stuck" is their choice. – chux - Reinstate Monica Apr 24 '20 at 18:16
  • 2
    @chux-ReinstateMonica Schrage created a portable LCG implementation with multiplier 16807 back around 1980 which showed how to handle the hi bits and lo bits separately to avoid issues with overflow. It was very popular throughout the 80's, alongside the rising popularity of C, and became known as [MINSTD](https://en.wikipedia.org/wiki/Lehmer_random_number_generator#Parameters_in_common_use). Since `rand` has no explicit required implementation, keeping MINSTD around for backwards compatibility seems reasonable to me given that they also promote better choices. In other words, history! – pjs Apr 24 '20 at 19:49
  • 3
    @chux-ReinstateMonica: Changing `rand()` *now* would break the guarantee that srand(x) leads to the same random sequence every time, on the same system. (If there is such a guarantee. It would at least be a potential compat problem for any programs that relied on it working the same across different Mac systems. That sounds like a pretty silly thing for a program to do, but I wouldn't be surprised if there are some. Or at least for convenience in debugging across systems.) Without an API to seed `rand()` with more than a 32-bit value, it can't be truly good so there's some justification. – Peter Cordes Apr 25 '20 at 04:18
  • 25
    the lack of a constant offset in mac's `rand()` makes it so bad that it isn't useful for practical usage: [Why does rand() % 7 always return 0?](https://stackoverflow.com/q/7866754/995714), [Rand() % 14 only generates the values 6 or 13](https://stackoverflow.com/q/20263187/995714) – phuclv Apr 25 '20 at 05:47
  • 6
    @PeterCordes: There is such a requirement on `rand`, that re-running it with the same seed produce the same sequence. OpenBSD's `rand` is broken and does not obey this contract. – R.. GitHub STOP HELPING ICE Apr 25 '20 at 05:55
  • 9
    @R..GitHubSTOPHELPINGICE Do you see a C requirement that `rand()` with the same seed produce the same sequence on between different versions of the library? Such a guarantee might be useful for regression testing between library versions, yet I find no C requirement for it. – chux - Reinstate Monica Apr 25 '20 at 07:54
  • Of course you only get duplicates from xorshift* because you're using a 64bit generator. If you used a 32bit generator then it wouldn't give any duplicates until the full sequence had been generated, 2^32 - 1 numbers later. – curiousdannii Apr 25 '20 at 10:46
  • @curiousdannii I tried with the 32-bit xorshift from the Wikipedia page, both by discarding the lowest bit and by zeroing the highest bit (in order to get a 31-bit result), and it results in about 536 million duplicates either way. Admittedly I am not an expert in xorshift generators, but my understanding is that the lowest bits are not "as random", so discarding some amount of bits seems like the correct approach anyway. – Arkku Apr 25 '20 at 12:26
  • 1
    @Arkku I don't think there's any bit bias in xorshift generators, but discarding bits will give you duplicates of course. – curiousdannii Apr 25 '20 at 13:18
  • @curiousdannii "XorShift* 64/32 are essentially the exact same generator, yet the former fails the test suite and the latter passes. The difference is that the former markets itself as a 64-bit generator and fails because its low-order bits are weak, whereas the latter only returns the top 32 bits." (https://www.pcg-random.org/pdf/hmc-cs-2014-0905.pdf) – Arkku Apr 25 '20 at 13:22
  • 2
    @Arkku Okay, it may be weaker, but I was referring to an actual 32 but Xorshift generator which won't give you any duplicates. – curiousdannii Apr 25 '20 at 13:46
  • 1
    @Arkku A good 64 bit generator has a cycle length O(2^64). If you only report 32 of the bits, by the pigeonhole principle it's definitely going to repeat individual values even though subsequences won't be repeated until you've gone full cycle. Lack of repeats occurs when a generator reports its entire state space and has maximal cycle length. That applies to both LCGs and xorshifts, and a solution to failing the birthday test is to construct a generator that has a larger state than gets reported. – pjs Apr 26 '20 at 00:42
  • @pjs Yes, so the 31-bit output from a 64- or 32-bit state is indeed just such a solution. =) – Arkku Apr 26 '20 at 07:45
  • plain xorshift should only produce 1 duplicate. – S.S. Anne Apr 26 '20 at 15:33
  • 2
    @S.S.Anne Only if it is a 31-bit plain xorshift. (Anyway, the desired outcome was to produce the expected 790 million duplicates, so the only reason I mentioned xorshift at all was that the truncated 64-bit xorshift* is a simple prng that meets this, and many other, criteria for randomness.) – Arkku Apr 26 '20 at 15:34
  • Yeah. 64-bit plain xorshift should produce about 2^32+2^31 or something, unless you take the high bits instead. – S.S. Anne Apr 26 '20 at 15:37
43

MacOS provides an undocumented rand() function in stdlib. If you leave it unseeded, then the first values it outputs are 16807, 282475249, 1622650073, 984943658 and 1144108930. A quick search will show that this sequence corresponds to a very basic LCG random number generator that iterates the following formula:

xn+1 = 75 · xn (mod 231 − 1)

Since the state of this RNG is described entirely by the value of a single 32-bit integer, its period is not very long. To be precise, it repeats itself every 231 − 2 iterations, outputting every value from 1 to 231 − 2.

I don't think there's a standard implementation of rand() for all versions of Linux, but there is a glibc rand() function that is often used. Instead of a single 32-bit state variable, this uses a pool of over 1000 bits, which to all intents and purposes will never produce a fully repeating sequence. Again, you can probably find out what version you have by printing the first few outputs from this RNG without seeding it first. (The glibc rand() function produces the numbers 1804289383, 846930886, 1681692777, 1714636915 and 1957747793.)

So the reason you're getting more collisions in Linux (and hardly any in MacOS) is that the Linux version of rand() is basically more random.

r3mainer
  • 23,981
  • 3
  • 51
  • 88
  • 5
    an unseeded `rand()` must behave like one with `srand(1);` – pmg Apr 24 '20 at 16:02
  • 5
    The source code for the `rand()` in macOS is available: https://opensource.apple.com/source/Libc/Libc-1353.11.2/stdlib/FreeBSD/rand.c FWIW, I ran the same test against this compiled from the source and it does indeed result in only one duplicate. Apple has been promoting the use of other random number generators (such as `arc4random()` before Swift took over) in their examples and documentation, so the use of `rand()` is probably not very common in native apps on their platforms, which may explain why it's not better. – Arkku Apr 24 '20 at 16:09
  • Thanks for the reply, that answers my question. And a period of (2^31)-2 explains why it would start repeating right at the end like I observed. You (@r3mainer) said `rand()` was undocumented, but @Arkku has provided a link to the apparent source. Do either of you know why I can't find that file on my system, and why I see only `int rand(void) __swift_unavailable("Use arc4random instead.");` in Mac's `stdlib.h`? I suppose the code @Arkku linked to is just compiled into... what library? – Theron S Apr 24 '20 at 16:27
  • 1
    @TheronS It is compiled into the C library, libc, `/usr/lib/libc.dylib`. =) – Arkku Apr 24 '20 at 16:28
  • 5
    Which version of `rand()` a given C program uses is not determined by the "compiler" or the "operating system", but rather the implementation of the C standard library (e.g., `glibc`, `libc.dylib`, `msvcrt*.dll`). – Peter O. Apr 24 '20 at 17:05
14

rand() is defined by the C standard, and the C standard does not specify which algorithm to use. Obviously, Apple is using an inferior algorithm to your GNU/Linux implementation: The Linux one is indistinguishable from a true random source in your test, while the Apple implementation just shuffles the numbers around.

If you want random numbers of any quality, either use a better PRNG that gives at least some guarantees on the quality of the numbers it returns, or simply read from /dev/urandom or similar. The later gives you cryptographic quality numbers, but is slow. Even if it is too slow by itself, /dev/urandom can provide some excellent seeds to some other, faster PRNG.

a2800276
  • 3,272
  • 22
  • 33
cmaster - reinstate monica
  • 38,891
  • 9
  • 62
  • 106
  • Thanks for the reply. I don't actually need a good PRNG, was just concerned that there was some undefined behavior lurking in my hashmap, then got curious when I eliminated that possibility and the platforms still behaved differently. – Theron S Apr 24 '20 at 16:36
  • btw here's an example of a cryptographically secure random number generator: https://github.com/divinity76/phpcpp/commit/a6b6d581017617cc03a2c88a19ae5f0d531bb190 - but it's C++ instead of C and i'm letting the STL implementors do all the heavy lifting.. – hanshenrik Apr 25 '20 at 10:17
  • 3
    @hanshenrik A crypto RNG is generally overkill & too slow for a simple hash table. – PM 2Ring Apr 25 '20 at 20:32
  • 1
    @PM2Ring Absolutely. A hash table hash primarily needs to be fast, not good. However, if you want to develop a hash table algorithm that's not just fast but also decent, I believe it's beneficial to know some of the tricks of cryptographic hash algorithms. It'll help you avoid most of the most glaring mistakes that riddle most fast hash algorithms. Nevertheless, I would not have advertised for a specific implementation here. – cmaster - reinstate monica Apr 25 '20 at 22:47
  • @cmaster True enough. It's certainly a good idea to know a bit about things like [mixing functions](https://nullprogram.com/blog/2018/07/31/) and the [avalanche effect](https://en.wikipedia.org/wiki/Avalanche_effect). Fortunately there are non-crypto hash functions with good properties that don't sacrifice too much speed (when implemented correctly), eg xxhash, murmur3, or siphash. – PM 2Ring Apr 25 '20 at 23:03
11

In general, the rand/srand pair has been considered sort of deprecated for a long time due to low-order bits displaying less randomness than high-order bits in the results. This may or may not have anything to do with your results, but I think this is still a good opportunity to remember that even though some rand/srand implementations are now more up to date, older implementations persist and it's better to use random(3). On my Arch Linux box, the following note is still in the man page for rand(3):

  The versions of rand() and srand() in the Linux C Library use the  same
   random number generator as random(3) and srandom(3), so the lower-order
   bits should be as random as the higher-order bits.  However,  on  older
   rand()  implementations,  and  on  current implementations on different
   systems, the lower-order bits are much less random than the  higher-or-
   der bits.  Do not use this function in applications intended to be por-
   table when good randomness is needed.  (Use random(3) instead.)

Just below that, the man page actually gives very short, very simple example implementations of rand and srand that are about the simplest LC RNGs you've ever seen and having a small RAND_MAX. I don't think they match what's in the C standard library, if they ever did. Or at least I hope not.

In general, if you're going to use something from the standard library, use random if you can (the man page lists it as POSIX standard back to POSIX.1-2001, but rand is standard way back before C was even standardized). Or better yet, crack open Numerical Recipes (or look for it online) or Knuth and implement one. They're really easy and you only really need to do it once to have a general purpose RNG with the attributes you most often need and which is of known quality.

Thomas Kammeyer
  • 4,457
  • 21
  • 30
  • Thanks for the context. I don't actually need high-quality randomness, and have implemented MT19937, though in Rust. Was mostly just curious about how to find out why the two platforms behaved differently. – Theron S Apr 24 '20 at 16:34
  • 1
    Sometimes the best questions are asked out of simple interest instead of strict need -- it seems like those are often the ones that beget a suite of good answers from a specific point of curiosity. Yours is one of them. Here's to all the curious people, the real and original hackers. – Thomas Kammeyer Apr 24 '20 at 17:55
  • It's funny that the advice was to "stop using rand()" instead of making rand() better. Nothing in the standard ever says that it has to be a specific generator. – pipe Apr 25 '20 at 12:44
  • 4
    @pipe If making `rand()` ‘better’ would mean making it slower (which it probably would — cryptographically-secure random numbers take a lot of effort), then it's probably better to keep it fast even if marginally more predictable.  Case in point: we had a production application that took ages to start up, which we traced to a RNG whose initialisation needed to wait for sufficient entropy to be generated…  Turned out it didn't need to be so secure, so replacing it with a ‘worse’ RNG was a big improvement. – gidds Apr 25 '20 at 15:26