How to produce normal distribution without duplicates in C++?

Question

I am trying to find a fast way to produce random float numbers between 0 and 1 normally distributed but from the implementations that I have seen its all about random numbers meaning that there are possible duplicate values.

My restriction is that I want to produce millions of numbers, lets say 8M or 16M etc and avoid duplicate numbers.

The C++ std::normal_distribution is a random generator as well.

Is there any implementation of what I seek or do I have to check every time if the produced value already exists (something that would really slow down down the whole program as we are speaking about millions of numbers).

I know that normal distribution implies duplicates that is why I have opened this question.

If you can't have duplicates then you won't get a normal distribution. — juanchopanza, Oct 06 '14 at 09:33
What are you trying to achieve? You can have two distinct numbers that differ only by 2^(-50), after a couple operations that need to round they may well become identical. Plus, since there are so many `double`s between 0 and 1 (or even just within one standard deviation), you'll probably never see two bit-identical outputs among millions. — , Oct 06 '14 at 09:36
Quite apart from juanchopanza's point, it sounds very much like "would really slow down down the whole program" is *speculation* preceding profiling... do you realise how many distinct numbers `double` can represent between 0 and 1? So, given only "millions of numbers" the duplicates should be infrequent, and the time to check for them not that big compared to the time to generate random numbers anyway. Try an `unordered_set already_seen` and *then* post a question if it's not fast enough for your needs. — Tony Delroy, Oct 06 '14 at 09:40
@juanchopanza: If the floating point type has infinite precision of course one can avoid duplicates and still get normal distributions, but that's not realistic. Either way, I guess OP's requirement is some XY problem since equality isn't that obvious for floating point numbers. — Benjamin Bannier, Oct 06 '14 at 09:50
@BenjaminBannier Mathemtically, as soon as you ask for no duplicates you lose normalness. Of course, we're dealing with pseudo-random number generators and limited precision FP numbers, so whether this really makes a difference is another issue. — juanchopanza, Oct 06 '14 at 09:53

leemes · Answer 1 · 2014-10-12T01:37:51.043

I'd solve this problem using a std::unordered_set to check for already generated numbers. This has expected constant time for both checking and inserting, since it's based on a hash table; summing up to a linear time complexity in the number of numbers to be generated N.

Generic solution which works with any distribution:

template <typename T, typename Dist, typename Gen>
std::unordered_set<T> unique_generate(Dist &&dist, Gen &&generator, size_t N)
{
    std::unordered_set<T> generated;
    while (generated.size() < N)
        generated.insert(dist(generator));
    return generated;
}

Usage with normal_distribution:

std::random_device rd;
std::mt19937 gen(rd());
std::normal_distribution<double> d(meanValue, stdDevValue);
int N = 1000000;

auto myNumbers = unique_generate<double>(d, gen, N);

To also enforce the numbers to be in the interval [0, 1], you can wrap the distribution object using a generic wrapper class ("separation of concerns": don't mix the unique generation with the distribution clamping).

A possible (possibly slow*) implementation throws away generated numbers which are out of bounds:

template<typename Dist, typename T>
class ClampedDistribution {
    Dist dist;
    T min, max;
public:
    ClampedDistribution(Dist dist, T min, T max) :
        dist(dist), min(min), max(max)
    {}

    template <typename Gen>
    auto operator()(const Gen & generator) -> decltype(dist(generator)) {
        auto value = dist(generator);
        while (value > max || value < min)
            value = dist(generator);
        return value;
    }
};

// type-deducing function:
template<typename Dist, typename T>
ClampedDistribution<Dist,T> clamped(Dist dist, T min, T max) {
    return ClampedDistribution<Dist,T>(dist, min, max);
}

Usage:

// (from above)
std::normal_distribution<double> d(meanValue, stdDevValue);

// clamp it:
auto clamped_dist = clamped(d, 0.0, 1.0);

// and pass this to unique_generate:
auto myNumbers = unique_generate(clamped_dist, gen, N);

*) It's slow if you choose a high standard deviation for your normal distribution. It's reasonably fast for small deviations though, as the numbers chosen by the normal distribution are more likely to be in the range already.

Thanks for the reply. I am trying to test this. Had to update for c++11 etc — BugShotGG, Oct 06 '14 at 10:19
Okay. I assumed you have C++11 available since you already mentioned the `std::normal_distribution` (which was introduced in C++11). — leemes, Oct 06 '14 at 19:29
Any idea why i get `error C2039: 'unordered_set' : is not a member of 'std'`? I included`` using c++11 but no avail... — BugShotGG, Oct 10 '14 at 14:46
@GeoPapas You included the wrong header :) `#include ` (set, no map) — leemes, Oct 10 '14 at 22:23
In which line do you get the error? Sorry for the missing `;` :) — leemes, Oct 12 '14 at 01:37
I asked about the line number... Without knowing that, the error message doesn't help me to help you... — leemes, Oct 12 '14 at 21:47
I think I have a wrong version of c++11 for vs2012. I am using this http://www.microsoft.com/en-us/download/details.aspx?id=35515 since I didnt find any newer version to install. I guess its a known issue http://stackoverflow.com/questions/7421825/c11-features-in-visual-studio-2012 — BugShotGG, Oct 13 '14 at 08:35

Galik · Answer 2 · 2014-10-10T11:16:49.620

1

My math is a little rusty so I hope I didn't make any horrible mistakes.

The approach I took was to increment a variable between -1 and +1. I then calculate the normal distribution curve for each value, comparing it against a random number between [0,1] to decide whether or not to include it in the output. Thus the closer we get to the mean the more values should get included - no duplicates.

After generating the numbers and storing them in a std::vector I perform a random shuffle:

#include <cmath>
#include <random>
#include <iostream>
#include <algorithm>

double normal(double x, const double mu, const double sigma)
{
    double fac = 1 / (sigma * sqrt(2 * M_PI));
    double exp = pow(x - mu, 2) / (2 * sigma * sigma);
    return fac * pow(M_E, -exp);
}

// res = resolution (distance between samples) [res < 1]
std::vector<double> generate(double res, const double mu, const double sigma)
{
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<> dis(0, 1);

    std::vector<double> values;

    for(double x = -1; x < 1; x += res)
        if(dis(gen) < normal(x, mu, sigma))
            values.push_back(x);

    std::shuffle(values.begin(), values.end(), gen);

    return values;
}

int main()
{
    std::vector<double> values = generate(0.000001, 0, 1);

    std::cout << values.size() << '\n';
    for(auto v: values)
        std::cout << v << '\n';
}

EDIT: Improved version:

Added parametrization of the range. Improved yield.

// normal probability density function
double normal_pdf(const double x, const double mu, const double sigma)
{
    double fac = 1 / (sigma * sqrt(2 * M_PI));
    double exp = pow(x - mu, 2) / (2 * sigma * sigma);
    return fac * pow(M_E, -exp);
}

/**
 * Randomly generate unique values between [i0, i1)
 * with a normal distribution.
 *
 * @param i0 The lower, inclusive bound of the range
 * of the generated values.
 *
 * @param i1 The upper, exclusive bound of the range
 * of the generated values.
 *
 * @param res The resolution. The size between samples when
 * calculating the values (< 0).
 *
 * @param mu The mean value of the distribution PDF
 *
 * @param sigma The Standard Deviation of the PDF.
 *
 * @return A std::vector<double> containing thegenerated
 * values.
 */
std::vector<double> generate(const double i0, const double i1
    , const double res, const double mu, const double sigma)
{
    std::random_device rd;
    std::mt19937 gen(rd());

    std::vector<double> values;

    double maximum = normal_pdf(mu, mu, sigma);

    std::uniform_real_distribution<> dis(0, maximum);

    for(double x = i0; x < i1; x += res)
        if(dis(gen) < normal_pdf(x, mu, sigma))
            values.push_back(x);

    std::shuffle(values.begin(), values.end(), gen);

    return values;
}

int main()
{
    std::vector<double> values = generate(0, 1, 0.01, 0.5, 1);

    std::cout << values.size() << '\n';
    for(auto v : values)
        std::cout << v << '\n';
}

edited Oct 10 '14 at 11:16

answered Oct 06 '14 at 16:10

Galik

47,303
4
80
117

Your solution seems to work well but I dont see where you define how many points to generate? Does the `res` variable imply it somehow? – BugShotGG Oct 09 '14 at 15:52
@GeoPapas There is no way I could think of to generate a fixed number of points (without sqewing the distribution). The `res` value will strongly affect the number of points generated and I am sure a formula could be produced to tell you the average or the expectation or whatever of would be produced. What you could do is *overproduce* more points than you need and erase eveything from the `std::vector` beyond that point (after the random shuffle). – Galik Oct 09 '14 at 19:05
Would work to use `generate(1/(N+1), 0, 1); ` as a parameter for `res`. In that way it will always produce values from 0 to 1 with a fitting step. And an if check could end the loop. – BugShotGG Oct 09 '14 at 19:34
@GeoPapas I don't think you can end the loop without skewing the distribution. That's why a suggested erasing the extra points only after the data has been randomly shuffled. – Galik Oct 09 '14 at 20:15
@GeoPapas Also I realized that my example produces nmbers between [-1,1] when you asked for [0,1] (the range covered by the for loop). – Galik Oct 09 '14 at 20:19
Would it work with a `res` as I described? Wouldnt it give a uniform distribution in that way? – BugShotGG Oct 10 '14 at 10:16
@GeoPapas I think `res` would need to be considerable smaller than that to produce enough random samples. There should be a mathematical way of calculating what it would need to be. I did have a brief go but my math is way too rusty. The other option is trial and error may produce a reasonable `factor` that would work for the range of values you are wanting (I doubt if the relationship is actually linear). – Galik Oct 10 '14 at 10:44
When I say "trial and error" I mean run the function with a range of values to help you calculate what `res` needs to be. – Galik Oct 10 '14 at 10:45
@GeoPapas Whilst reviewing I just realized a way I can increase the number of generated samples in the range. The normal PDF will likely be lower than 1.0 at its maximum so my selecting random numbers between [0,1] is wasteful. I can calculate the the maximum and choose random numbers between [0,PDF(max)] to give a much better yield. I will post the new code soon. – Galik Oct 10 '14 at 10:54
@GeoPapas Just to explain why `res = 1/(N+1)` doesn't give you the quantity you want. That number does cause the function to iterate over the number of samples you need. However those samples are only included on a random basis. So the number of samples that get included will always be less than needed. For this reason you need to choose `res` such that it over-produces. – Galik Oct 10 '14 at 11:20
Yes I tried with this and it seems to not work. It generates more like a uniform distribution if we use `res = 1/(N+1)`. I cant believe that something so simple would be that hard to implement though. lol – BugShotGG Oct 10 '14 at 12:02
@GeoPapas The higher the SD (sigma) the more it looks normal I think. I seem to be getting a nice normal curve using `sigma = 0.1` & `res = 0.00001` giving apprx `25150` values. – Galik Oct 10 '14 at 13:04

score 0 · Answer 3 · answered Oct 06 '14 at 09:47

0

Mechanical answer (not questioning "normal - without duplicate") below:

using gen = std::normal_distribution<long>(_1, _2);
std::set<long> data; // if you want to use double you'll need to customize the comparator
std::generate(std::inserter(data, data.end()), _3, gen);

answered Oct 06 '14 at 09:47

bobah

18,364
2
37
70

I guess the downvote was because this has complexity `O(n log n)` instead of the version with `unordered_hash` which is (expected) linear time. – leemes Oct 06 '14 at 19:30
complexity of a single hash map lookup is `O(n)`, which is not always tolerable – bobah Oct 06 '14 at 22:05
That's why I said *expected*, which is constant time per lookup / insertion. – leemes Oct 08 '14 at 07:58

score -1 · Answer 4 · answered Oct 06 '14 at 09:49

-1

Normal distribution means that you may have many duplicates near the math expectation value. Your constraint make sense when you deal with even distribution.

answered Oct 06 '14 at 09:49

gsa

516
5
9

"Near each other" and "equal" are different concepts. – Benjamin Bannier Oct 06 '14 at 09:50
it is a question of terminology after all. if the experiment is "whether a particular keyboard key has been pressed at least once" then what author requests may be a viable test data. – bobah Oct 06 '14 at 09:52
This is discrete distribution not continuous. So you can get equal numbers, I think it's acceptable. – gsa Oct 06 '14 at 09:57

How to produce normal distribution without duplicates in C++?

4 Answers4