How can I generate sorted uniformly distributed random numbers efficiently in C++?

Question

I'd like to generate a large number, n, (i.e., n >= 1,000,000,000) of sorted and uniformly distributed random numbers in C++.

A first and simple approach I considered was to

sequentially generate n uniformly distributed numbers using an std::uniform_real_distribution<double>,
and then sort them using std::sort.

However, this takes several minutes.

A second and more sophisticated approach was to do parallelize the two steps as in:

template <typename T>
void computeUniformDistribution(std::vector<T>& elements)
{
    #pragma omp parallel
    {
        std::seed_seq seed{distribution_seed, static_cast<size_t>(omp_get_thread_num())};
        std::mt19937 prng = std::mt19937(seed);
        std::uniform_real_distribution<double> uniform_dist(0, std::numeric_limits<T>::max());

        #pragma omp for
        for (size_t i = 0; i < elements.size(); ++i)
        {
            elements[i] = static_cast<T>(uniform_dist(prng));
        }
    }

    std::sort(std::execution::par_unseq, elements.begin(), elements.end());
}

However, even this takes about about 30 seconds. Given that the generation of the uniformly distributed numbers takes only about 1.5 seconds, the bottleneck is still the sort phase.

Hence, I'd like to ask the following question: How can I efficiently generate uniformly distributed data in a sorted way?

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/219889/discussion-on-question-by-epic-skyrise-tm-how-can-i-generate-sorted-uniformly-di). — Machavity, Aug 15 '20 at 22:54
[Your code is not seeding the generator correctly](https://gist.github.com/klmr/62863c3d9f5827df23ae2e1415b0cb1b), and your generated sequence will consequently be skewed towards a few possible samples instead of sampling the entire available space. — Konrad Rudolph, Aug 16 '20 at 11:46
@KonradRudolph Could you please elaborate on this a bit further? — p4dn24x, Aug 16 '20 at 12:11
See https://codereview.stackexchange.com/q/109260/308 *and the discussion comments there*. And *don’t* just read the answers, in particular note that the currently accepted answer is flat out wrong. — Konrad Rudolph, Aug 16 '20 at 15:44
@KonradRudolph If it's wrong, and you've said this for years, could you write a better answer? — Mast, Aug 16 '20 at 16:32
@KonradRudolph Thanks for the hint. Am I right under the assumption though that your provided implementation for seeding an `std::mt19937` will always produce the same sequence? My use-case would be to always produce the same sequence (for deterministic benchmarking) for a given argument (the seed). — p4dn24x, Aug 16 '20 at 17:47
@epic-skyrise-tm No, my example code uses a random seed. I hadn’t realised that your code was using a deterministic seed (I noticed the `omp_get_thread_num()` call but I thought the bit before that was random). For deterministic seeding your code is probably OK-ish. Ideally you’d still generate the `distribution_seed` randomly, and of size `seed_bits` bits (minus the bit size of `size_t`). — Konrad Rudolph, Aug 17 '20 at 06:43
"A first and simple approach I considered was to sequentially generate n uniformly distributed numbers using an std::uniform_real_distribution, and then sort them using std::sort. However, this takes several minutes." - If this takes several minutes then I suspect you are running an unoptimized build and/or using *really* old hardware. — Jesper Juhl, Jun 08 '23 at 18:46
@JesperJuhl did you notice OP is working with `n` set to a billion (i.e. 1e9). generating and sorting this is a significant amount of work! — Sam Mason, Jun 08 '23 at 18:51
@SamMason Sure, that's a significant amount of numbers, but on my AMD 7950X system, with 64G of memory, with an optimized build, it doesn't take *minutes*. — Jesper Juhl, Jun 08 '23 at 19:09
@JesperJuhl non-optimised compiles of the STL often have a significantly worse delta than that in my experience, but would be good for OP to clarify! — Sam Mason, Jun 08 '23 at 19:32
As a note, `std::uniform_int_distribution` is 10x faster than `poisson_distribution`, at least for the setup that I tested. https://quick-bench.com/q/YXITwsrGGPZ-K2xAVfYrh7ZM-aQ — Mooing Duck, Jun 09 '23 at 13:32
Not an expert in RNG, but if you don't care about the exact number of points, and since the spacing of a uniform distribution is a Poisson distribution, you can produce an ordered sequence directly by drawing from the appropriate Poisson. The price you pay (if it works) is that you won't get an exact number of points. But you can do that by doing a first "Poisson" pass, adding the remaining points with a uniform distribution, and sorting an almost sorted sequence. I wouldn't be surprised if this takes the same or longer than your approach, but it is something to try. — alfC, Jun 09 '23 at 18:42
I found that generating integers between 0 and `1ull<<52` and then `static_cast(cache[i]) / (1ull<<52) * std::numeric_limits::max()` was 2x faster than `uniform_real_distribution`, though is very subtly wrong for very small values. https://quick-bench.com/q/PMDMiWl7SNR28FGzMpgWGzxEOno — Mooing Duck, Jun 09 '23 at 19:55

David Eisenstat · Accepted Answer · 2020-08-15T15:00:42.000

There are ways to generate samples that are already sorted, but I think that it might be better to generate partially sorted samples.

Divide the output range into k buckets of equal width. The number of samples in each bucket will have multinomial distribution with equal probabilities. The slow method to sample the multinomial distribution is to generate n integers in [0, k). A more efficient method is to draw k Poisson samples with rate n/k conditioned on their sum not exceeding n, then add another n - sum samples using the slow way. Sampling the Poisson distribution is tricky to do perfectly, but when n/k is very large (as it will be here), the Poisson distribution is excellently approximated by rounding a normal distribution with mean and variance n/k. If that's unacceptable, the slow method does parallelize well.

Given the bucket counts, compute the prefix sums to find the bucket boundaries. For each bucket in parallel, generate the given number of samples within the bucketed range and sort them. If we choose n/k well, each bucket will almost certainly fit in L1 cache. For n = 1e9, I think I'd try k = 1e5 or k = 1e6.

Here's a sequential implementation. A little unpolished since we really need to avoid 2x oversampling the bucket boundaries, which are closed, but I'll leave that to you. I'm not familiar with OMP, but I think you can get a pretty good parallel implementation by adding a pragma to the for loop at the end of SortedUniformSamples.

#include <algorithm>
#include <cmath>
#include <iostream>
#include <numeric>
#include <random>
#include <span>
#include <vector>

template <typename Dist, typename Gen>
void SortedSamples(std::span<double> samples, Dist dist, Gen& gen) {
  for (double& sample : samples) {
    sample = dist(gen);
  }
  std::sort(samples.begin(), samples.end());
}

template <typename Gen>
void ApproxMultinomialSample(std::span<std::size_t> samples, std::size_t n,
                             Gen& gen) {
  double lambda = static_cast<double>(n) / samples.size();
  std::normal_distribution<double> approx_poisson{lambda, std::sqrt(lambda)};
  std::size_t sum;
  do {
    for (std::size_t& sample : samples) {
      sample = std::lrint(approx_poisson(gen));
    }
    sum = std::accumulate(samples.begin(), samples.end(), std::size_t{0});
  } while (sum > n);
  std::uniform_int_distribution<std::size_t> uniform{0, samples.size() - 1};
  for (; sum < n; sum++) {
    samples[uniform(gen)]++;
  }
}

template <typename Gen>
void SortedUniformSamples(std::span<double> samples, Gen& gen) {
  static constexpr std::size_t kTargetBucketSize = 1024;
  if (samples.size() < kTargetBucketSize) {
    SortedSamples(samples, std::uniform_real_distribution<double>{0, 1}, gen);
    return;
  }
  std::size_t num_buckets = samples.size() / kTargetBucketSize;
  std::vector<std::size_t> bucket_counts(num_buckets);
  ApproxMultinomialSample(bucket_counts, samples.size(), gen);
  std::vector<std::size_t> prefix_sums(num_buckets + 1);
  std::partial_sum(bucket_counts.begin(), bucket_counts.end(),
                   ++prefix_sums.begin());
  for (std::size_t i = 0; i < num_buckets; i++) {
    SortedSamples(std::span<double>{&samples[prefix_sums[i]],
                                    &samples[prefix_sums[i + 1]]},
                  std::uniform_real_distribution<double>{
                      static_cast<double>(i) / num_buckets,
                      static_cast<double>(i + 1) / num_buckets},
                  gen);
  }
}

int main() {
  std::vector<double> samples(100000000);
  std::default_random_engine gen;
  SortedUniformSamples(samples, gen);
  if (std::is_sorted(samples.begin(), samples.end())) {
    std::cout << "sorted\n";
  }
}

If your standard library has a high-quality implementation of poisson_distribution, you could also do this:

template <typename Gen>
void MultinomialSample(std::span<std::size_t> samples, std::size_t n,
                       Gen& gen) {
  double lambda = static_cast<double>(n) / samples.size();
  std::poisson_distribution<std::size_t> poisson{lambda};
  std::size_t sum;
  do {
    for (std::size_t& sample : samples) {
      sample = poisson(gen);
    }
    sum = std::accumulate(samples.begin(), samples.end(), std::size_t{0});
  } while (sum > n);
  std::uniform_int_distribution<std::size_t> uniform{0, samples.size() - 1};
  for (; sum < n; sum++) {
    samples[uniform(gen)]++;
  }
}

Are the bucket sizes (meaning number of values) equivalent to the bucket sizes you'd get from generating the full vector and then bucketing the elements with the same fixed bucket border values? (The latter is what I thought of, each core grabbing the values of its assigned bucket range, sorting them, and writing them back into the source vector where they belong.) — superb rain, Aug 15 '20 at 14:39
@superbrain If we were able to sample the Poisson distribution exactly, then yes, it would be exactly equivalent. With lambda > 1024, though, the normal distribution is a very, very good approximation. — David Eisenstat, Aug 15 '20 at 14:43
@superbrain Added an exact variant that will be fast assuming a fast exact implementation of `std::poisson_distribution`. — David Eisenstat, Aug 15 '20 at 15:01
Nice, thanks. I can't really judge/understand that math, but it's good to know the two ways are equivalent in that sense. I'd say mine is simpler but yours is more efficient (since the cores don't need to find/gather the elements belonging in their ranges but just produce them). — superb rain, Aug 15 '20 at 15:07
Why not simply use bucket sort? This approach shares many similarities with bucket sort, see my answer. — ciamej, Aug 15 '20 at 15:30
@ciamej I think it uses too much memory bandwidth at this scale, but it's worth a try. Generating Poisson variates is noticeably more expensive than uniform, but it parallelizes trivially and we need 1,000x fewer of them than uniform samples, so I doubt that it will be the bottleneck. — David Eisenstat, Aug 15 '20 at 15:38
I'm not sure I get your point. You can have each thread generate its slice of the input array and at the same time keep track of how many numbers fit each bucket. Thus you have bucket histogram per each thread. Then you sum up the histograms, compute prefixes and here you are. How can that be worse to your approach? — ciamej, Aug 15 '20 at 15:51
@superbrain ok, I missed the point, that it requires a separate output array. — ciamej, Aug 15 '20 at 15:56
@DavidEisenstat https://quick-bench.com/q/PMDMiWl7SNR28FGzMpgWGzxEOno I made a change where the `approx_poisson` is recalculated for each block, and it makes no noticeable difference in performance, but reduces the cases where several large blocks lead to a teeny or even negative block at the end. I also made a variant that generates random *integers* between 0 and `1ull<52` (aka mantissa bits), and then scales that as a double to the target range, and surprisingly, that's ~2x faster! — Mooing Duck, Jun 09 '23 at 19:53

Sam Mason · Answer 2 · 2023-06-09T07:53:08.333

I'd be tempted to rely on the fact that the difference between consecutive elements of a sorted set of uniformly distributed variables are exponentially distributed. This can be exploited to run in O(N) time rather than O(N*log N).

A quick implementation would do something like:

template<typename T> void
computeSorteUniform2(std::vector<T>& elements)
{
    std::random_device rd;
    std::mt19937 prng(rd());

    std::exponential_distribution<T> dist(static_cast<T>(1));

    auto sum = dist(prng);

    for (auto& elem : elements) {
        elem = sum += dist(prng);
    }

    sum += dist(prng);

    for (auto& elem : elements) {
        elem /= sum;
    }
}

this example is simplified by assuming you want values in Uniform(0, 1), but it should be easy to generalise. Making this work using OMP isn't quite trivial, but shouldn't be too hard.

If you care about the last ~50% performance there are some numeric tricks that might speed up generating random deviates (e.g. there are faster and better PRNGs than the MT) as well as converting them to doubles (but recent compilers might know about these tricks). A couple of references: Daniel Lemire's blog and Melissa O'Neill's PCG site.

I've just benchmarked this and discovered that clang's std::uniform_real_distribution and std::exponential_distribution are both very slow. numpy's Ziggurat based implementations are 8 times faster, such that I can generate 1e9 double's in ~10 seconds using a single thread on my laptop (i.e. std implementations take ~80 seconds) using the above algorithm. I've not tried OP's implementation on 1e9 elements, but with 1e8 elements mine is ~15 times faster.

Inspired by AndrisBirkmanis comments, I had a go at creating a version that would permit streaming and/or parallelisation. The idea is to split the n items up into k chunks. An initial amount of work has to be done (O(k)) to figure out what the values are at these chunk boundaries and then these chunks can be processed independently. This would mean you could stream chunks back or farm them out via OMP. I wrote the code in Python, but my translation to C++ was more than twice as long so am leaving it out.

def stream_sorted_uniform(n, *, chunksize=1024):
    "iterate over n sorted numbers drawn uniformly from [0, 1]"
    # this code works by generating chunks, first step is to chunk up the range
    # values stores variates at chunk boundaries, blocks specify indicies of variates
    blocks = [0]
    # these are unnormalised for now
    values = [np.random.exponential()]
    for i in range(0, n, chunksize):
        # index of last value in this chunk
        j = min(n - 1, i + chunksize)
        # gamma with shape=n is the equivalent to the sum of n exponentials
        blocks.append(j)
        values.append(np.random.gamma(j - i))
    # normalise so chunk values are in [0, 1] with appropriate gaps at either end
    values = np.cumsum(values)
    values /= values[-1] + np.random.exponential()
    # output first element
    yield values[0]
    # elements in this loop can be done in parallel
    for [i, mn], [j, mx] in itertools.pairwise(zip(blocks, values)):
        # generate subsequent variates for this chunk
        x = np.cumsum(np.random.exponential(size=j - i))
        # rescale to given range, would be nice if numpy exposed a FMA operator!
        x *= (mx - mn) / x[-1]
        x += mn
        # output elements [i+1..j]
        yield from x

If you're running in a REPL, you could test via:

print(list(stream_sorted_uniform(12, chunksize=4)))

To make sure it's doing the right thing, I ran stream_sorted_uniform(301, chunksize=35) a million times and generated a 2D histogram over the quantiles, comparing it to a naive np.sort(np.random.uniform(size=301)). Which produces:

I'm actually showing the sqrt of the histogram counts otherwise the scale causes things to look very boring. But for reference the midpoints (i.e. index 150, 50'th percentile) have unscaled counts of 46305 and 46216 which looks like the expected Poisson noise from any random algorithm.

Does that mean that my answer's bias (as mentioned in the comments) is somewhat ok? — superb rain, Aug 15 '20 at 22:33
oops, apparently both of our stats is wrong. I've just done an empirical check and reminded myself that the distribution is exponential, not uniform... will fix up my answer thanks for the comment! — Sam Mason, Aug 15 '20 at 22:43
The exponential distribution is correct. With that change, this is a great solution! The specific result: Let E_1, E_2, ..., E_{n+1} be independent exponential(1) random variables. Let S=E_1+...+E_{n+1}. Then the distribution of (E_1/S, E_2/S..., E_n/S) is the same as that of n independent random variables uniformly distributed on (0,1) and sorted into increasing order. See for example http://djalil.chafai.net/blog/2014/06/03/back-to-basics-order-statistics-of-exponential-distribution/ . — James Martin, Aug 16 '20 at 11:51
Technically, this still looks like order n log(n) rather than order n. You have to represent your variable "sum" to enough precision that you can add terms of order 1 (or in fact, order 1/n) to a total of order n, which requires log n bits. This is somehow inevitable, whatever algorithm you use: just to record the output, with sufficient precision that you can distinguish neighbouring terms in the sequence, takes space order n log(n). — James Martin, Aug 16 '20 at 12:00
@JamesMartin thanks for the vote of confidence! your time complexity comment seems a bit strict though, it would imply that any `sum` implementation would also be considered n log(n). while I'd agree that this is true I'd argue it's not a useful distinction as the multiplier on the log(n) is basically zero (floating point add takes constant time on most CPUs). more important than accumulating results into the sum is that the effects of cache/other parts of the memory subsystem will have on the overall runtime — Sam Mason, Aug 16 '20 at 13:07
Ah sorry, I didn't put enough thought into the effect of dividing by `sum` (I'll delete my earlier comments). Thanks for the link @JamesMartin, I wasn't familiar with that result. The original question suggests that `double`s are precise enough for the O.P.'s needs, and for log10 n<16, `double` has more than log2 n mantissa bits. So this should probably be the accepted answer, both for being log2 n times faster than sorting and for being extremely simple. Furthermore, the sample generation can be parallelised, as can the prefix-sum calculation. — Ose, Aug 16 '20 at 14:18
I believe the last normalization step (and keeping of the sum) can be avoided by just using exponential distribution with the right rate (not 1). The rate then should be `n + 1`, as the expected max point is `n / (n + 1)` and the mean of exponential is `1 / lambda`. — Andris Birkmanis, Jun 06 '23 at 14:40
My previous comment needs to be clarified: the result would NOT be a uniform distribution, but an approximation (notably, the range is unlimited). This approximation may be useful in some applications that can tolerate this overflow (e.g., by clipping). — Andris Birkmanis, Jun 06 '23 at 15:33
@AndrisBirkmanis yup. specifically, the sum would then follow the Erlang distribution and the variance of the end point would be `1/(n+1)`. note that when drawing, e.g., a million points from this you should expect more than 1600 points to be clipped 5% of the time. this might well be worth the tradeoff, but given how cheap the normalisation is (and this error only scales with the `sqrt(n)`) I'm not sure where this would help — Sam Mason, Jun 06 '23 at 17:23
Normalization is cheap, but can only be done at the end, precluding any kind of "streamable"/"online" use of values while they are being generated. Notably, they all need to be stored, which by itself may be a deal breaker when operating at huge scale. — Andris Birkmanis, Jun 07 '23 at 02:08
@AndrisBirkmanis have made a version that would give exact answers at arbitrary scales, enjoy! could adjust things so it fits in cache, or farm out to a GPU if wanted — Sam Mason, Jun 08 '23 at 18:41
@SamMason Using gamma was smart, I wish I knew all those facts about relationships between various distributions. I guess that's what education is for :) — Andris Birkmanis, Jun 09 '23 at 04:45

rcgldr · Answer 3 · 2020-08-16T06:15:21.227

I ran some tests and radix sort was 4 to 6 times as fast as std::sort depending on the system, but it requires a second vector, and for 1 GB of elements, each vector of doubles is 8 GB, for a total of 16 GB of available memory, so you would probably need 32 GB of RAM.

A multi-threading radix sort may help if the sort is not memory bandwidth limited.

Example single threaded code:

#include <algorithm>
#include <iostream>
#include <random>
#include <vector>
#include <time.h>

clock_t ctTimeStart;            // clock values
clock_t ctTimeStop;

typedef unsigned long long uint64_t;

//  a is input array, b is working array
uint64_t * RadixSort(uint64_t * a, uint64_t *b, size_t count)
{
uint32_t mIndex[8][256] = {0};          // count / index matrix
uint32_t i,j,m,n;
uint64_t u;
    for(i = 0; i < count; i++){         // generate histograms
        u = a[i];
        for(j = 0; j < 8; j++){
            mIndex[j][(size_t)(u & 0xff)]++;
            u >>= 8;
        }
    }
    for(j = 0; j < 8; j++){             // convert to indices
        m = 0;
        for(i = 0; i < 256; i++){
            n = mIndex[j][i];
            mIndex[j][i] = m;
            m += n;
        }
    }
    for(j = 0; j < 8; j++){             // radix sort
        for(i = 0; i < count; i++){     //  sort by current LSB
            u = a[i];
            m = (size_t)(u>>(j<<3))&0xff;
            b[mIndex[j][m]++] = u;
        }
        std::swap(a, b);                //  swap ptrs
    }
    return(a);
}

#define COUNT (1024*1024*1024)

int main(int argc, char**argv)
{
    std::vector<double> v(COUNT);       // vctr to be generated
    std::vector<double> t(COUNT);       // temp vector
    std::random_device rd;
    std::mt19937 gen(rd());
//  std::uniform_real_distribution<> dis(0, std::numeric_limits<double>::max());
    std::uniform_real_distribution<> dis(0, COUNT);
    ctTimeStart = clock();
    for(size_t i = 0; i < v.size(); i++)
        v[i] = dis(gen);
    ctTimeStop = clock();
    std::cout << "# of ticks " << ctTimeStop - ctTimeStart << std::endl;
    ctTimeStart = clock();
//  std::sort(v.begin(), v.end());
    RadixSort((uint64_t *)&v[0], (uint64_t *)&t[0], COUNT);
    ctTimeStop = clock();
    std::cout << "# of ticks " << ctTimeStop - ctTimeStart << std::endl;
    return(0);
}

If sorting doubles (cast to 64 bit unsigned integers) that include negative values you'll need to treat them as sign + magnitude 64 bit integers. C++ macros used to convert sign + magnitude (SM) to/from 64 bit unsigned integers (ULL):

// converting doubles to unsigned long long for radix sort or something similar
// note -0 converted to 0x7fffffffffffffff, +0 converted to 0x8000000000000000
// -0 is unlikely to be produced by a float operation

#define SM2ULL(x) ((x)^(((~(x) >> 63)-1) | 0x8000000000000000ull))
#define ULL2SM(x) ((x)^((( (x) >> 63)-1) | 0x8000000000000000ull))

Peter O. · Answer 4 · 2020-08-16T02:22:14.297

There is a simple observation involving sorted uniform random numbers in [0, 1]:

Each uniform [0, 1] number is equally likely to be less than half or greater than half. Thus, the number of uniform [0, 1] numbers that are less than half vs. greater than half follows a binomial(n, 1/2) distribution.
Of the numbers less than half, each number is as likely to be less than 1/4 as it is to be greater than 1/4, so that the less-than-1/4 vs. greater-than-1/4 numbers follow the same distribution.
And so on.

Thus, each number can be generated one bit at a time, from left to right after the binary point. Here is a sketch of how this works to generate n sorted uniform random numbers:

If n is 0 or 1, stop. Otherwise, generate b, a binomial(n, 1/2) random number.
Append 0 to the first b random numbers and 1 to the rest.
Run this algorithm recursively on the first b numbers, but with n = b.
Run this algorithm recursively on the rest of the numbers, but with n = n − b.

At this point, we have a sorted list of random numbers with varying bit counts. All that is left to do is to fill each number with uniform random bits as needed (or chop off or round excess bits) to give the number p bits (for example 53 bits for double precision). Then, divide each number by 2^p.

I give a similar algorithm to find the k-th smallest out of n random numbers.

Nice observation and solution! The samples can be generated in-place in a single length-n array too. Looks like it will take log2(n) recursive calls to hit the base case on average though, so I'd expect the average time complexity to be O(n log(n)), which is asymptotically no better than generating n uniform samples and sorting them. Have you checked the time complexity? — Ose, Aug 16 '20 at 03:44

score 0 · Answer 5 · answered Jun 09 '23 at 18:26

This is Java, not C++ (which I don't know), but could easily be adapted.

You can generate your numbers in sorted order in linear time. The paper describing how to do this is: Generating Sorted Lists of Random Numbers by Bentley & Saxe

https://pdfs.semanticscholar.org/2dbc/4e3f10b88832fcd5fb88d34b8fb0b0102000.pdf

/**
 * Generate an sorted list of random numbers sorted from 1 to 0, given the size
 * of the list being requested.
 * 
 * This is an implementation of an algorithm developed by Bentley and Sax, and
 * published in in ACM Transactions on Mathematical Software (v6, iss3, 1980) on
 * 'Generating Sorted Lists of Random Numbers'.
 */
public class SortedRandomDoubleGenerator {
    private long       valsFound;
    private double     curMax;
    private final long numVals;

    /**
     * Instantiate a generator of sorted random doubles.
     * 
     * @param numVals the size of the list of sorted random doubles to be
     *        generated
     */
    public SortedRandomDoubleGenerator(long numVals) {
        curMax = 1.0;
        valsFound = 0;
        this.numVals = numVals;
    }

    /**
     * @return the next random number, in descending order.
     */
    public double getNext() {
        curMax = curMax
                * Math.pow(Math.E, Math.log(RandomNumbers.nextDouble())
                        / (numVals - valsFound));
        valsFound++;
        return curMax;
    }
}

Originally an answer to a similar question, here: https://stackoverflow.com/questions/46058304/random-numbers-external-sort/46061122#46061122. The linked question is older, but this one is higher rated. Should one of these be closed as a duplicate of the other? — Dave, Jun 09 '23 at 18:29
As trincot noted on the linked post: The referenced article proposes the use of exp() and log() as a workaround for programming languages that do not have a more direct way to apply a fractional exponent. But in the programming language you chose (Java) and the OP's language (C++) this is not an issue. You could just do curMax * Math.pow(Random.nextDouble(), 1 / (numVals - valsFound)) avoiding the log call. — Dave, Jun 09 '23 at 18:32

How can I generate sorted uniformly distributed random numbers efficiently in C++?

5 Answers5