2

I'm running the following program:

#include <iostream>
#include <vector>
#include <cmath>
#include <cstdlib>
#include <chrono>
using namespace std;

const int N = 200;          // Number of tests.
const int M = 2000000;      // Number of pseudo-random values generated per test.
const int VALS = 2;         // Number of possible values (values from 0 to VALS-1).
const int ESP = M / VALS;   // Expected number of appearances of each value per test.

int main() {
    for (int i = 0; i < N; ++i) {
        unsigned seed = chrono::system_clock::now().time_since_epoch().count();
        srand(seed);
        vector<int> hist(VALS, 0);
        for (int j = 0; j < M; ++j) ++hist[rand() % VALS];
        int Y = 0;
        for (int j = 0; j < VALS; ++j) Y += abs(hist[j] - ESP);
        cout << Y << endl;
    }
}

This program performs N tests. In each test we generate M numbers between 0 and VALS-1 while we keep counting their appearances in a histogram. Finally, we accumulate in Y the errors, which correspond to the difference between each value of the histogram and the expected value. Since the numbers are generated randomly, each of them would ideally appear M/VALS times per test.

After running my program I analysed the resulting data (i.e., the 200 values of Y) and I realised that some things where happening which I can not explain. I saw that, if the program is compiled with vc++ and given some N and VALS (N = 200 and VALS = 2 in this case), we get different data patterns for different values of M. For some tests the resulting data follows a normal distribution, and for some tests it doesn't. Moreover, this type of results seem to altern as M (the number of pseudo-random values generated in each test) increases:

  • M = 10K, data is not normal:

enter image description here enter image description here

  • M = 100K, data is normal:

enter image description here enter image description here

  • and so on:

enter image description here enter image description here

enter image description here enter image description here

enter image description here enter image description here

As you can see, depending on the value of M the resulting data follows a normal distribution or otherwise follows a non-normal distribution (bimodal, dog food or kind of uniform) in which more extreme values of Y have greater presence.

This diversity of results doesn't occur if we compile the program with other C++ compilers (gcc and clang). In this case, it looks like we always obtain a half-normal distribution of Y values:

enter image description here enter image description here

What are your thoughts on this? What is the explanation?

I carried out the tests through this online compiler: http://rextester.com/l/cpp_online_compiler_visual

Manelicus
  • 61
  • 7
  • 3
    [rand is known to have bad properties that is why it was removed in C++17](https://stackoverflow.com/a/26853142/1708801). How bad it is varies from implementation. – Shafik Yaghmour May 10 '18 at 23:11
  • OK but what about my question. – Manelicus May 10 '18 at 23:13
  • `rand()` on [MSVC uses an extremely poor PRNG](https://stackoverflow.com/questions/6793065/understanding-the-algorithm-of-visual-cs-rand-function). It's a linear congruential generator (very simple but not very random), and outputs only 16 bit numbers. – Daniel May 10 '18 at 23:13
  • 3
    Run this test suit over `std::rand` (MSVC version) https://webhome.phy.duke.edu/~rgb/General/dieharder.php And find that you don't really have a random number generator. So it's a case of garbage in etc... – Richard Critten May 10 '18 at 23:16

1 Answers1

0

The program will generate poorly distributed random numbers (not uniform, independent).

  1. The function rand is a notoriously poor one.
  2. The use of the remainder operator % to bring the numbers into range effectively discards all but the low-order bits.
  3. The RNG is re-seeded every time through the loop.

[edit] I just noticed const int ESP = M / VALS;. You want a floating point number instead.

Try the code below and report back. Using the new &LT;random> is a little tedious. Many people write some small library code to simplify its use.

#include <iostream>
#include <vector>
#include <cmath>
#include <random>
#include <chrono>
using namespace std;

const int N = 200;          // Number of tests.
const int M = 2000000;      // Number of pseudo-random values generated per test.
const int VALS = 2;         // Number of possible values (values from 0 to VALS-1).
const double ESP = (1.0*M)/VALS; // Expected number of appearances of each value per test.

static std::default_random_engine engine;

static void seed() {
    std::random_device rd;
    engine.seed(rd());
}
static int rand_int(int lo, int hi) {
    std::uniform_int_distribution<int> dist (lo, hi - 1);
    return dist(engine);
}
int main() {
    seed();
    for (int i = 0; i < N; ++i) {
        vector<int> hist(VALS, 0);
        for (int j = 0; j < M; ++j) ++hist[rand_int(0, VALS)];
        int Y = 0;
        for (int j = 0; j < VALS; ++j) Y += abs(hist[j] - ESP);
        cout << Y << endl;
    }
}
Jive Dadson
  • 16,680
  • 9
  • 52
  • 65
  • Are you still with us? – Jive Dadson May 11 '18 at 02:21
  • Thanks for your remarks and code. The uniform_int_distribution is lacking its data type declaration. The data produced by your program follows some kind of half-normal distribution (as expected). My question remains still unanswered. Even if the flaws of rand() mattered, It'd still be interesting to know 1. why they matter in this case and how do they effect the data and 2. why the output data depends on the compiler. – Manelicus May 11 '18 at 02:33
  • I does answer the question, "What are your thoughts on this, and what is the explanation?" The explanation is three-part. As to why different compilers give differing results, put that down to different implementations of `rand`. The VC++ implementation is probably worse than most. – Jive Dadson May 11 '18 at 02:39
  • Why the flaws in rand() matter: Garbage-in, garbage out. An in-depth explanation is far too involved to go into here. See this: https://www.scribd.com/document/33641002/Random-Numbers-Fall-Mainly-in-the-Planes The default type for `uniform_int_distribution` is `int`. Take the gift horse. – Jive Dadson May 11 '18 at 02:43
  • BTW, the paper by Marsaglia is more evidence that everything about software engineering was discovered in 1968. – Jive Dadson May 11 '18 at 02:50
  • Are you still here? Do you want my help in the future? – Jive Dadson May 11 '18 at 03:33