I'd be tempted to rely on the fact that the difference between consecutive elements of a sorted set of uniformly distributed variables are exponentially distributed. This can be exploited to run in O(N)
time rather than O(N*log N)
.
A quick implementation would do something like:
template<typename T> void
computeSorteUniform2(std::vector<T>& elements)
{
std::random_device rd;
std::mt19937 prng(rd());
std::exponential_distribution<T> dist(static_cast<T>(1));
auto sum = dist(prng);
for (auto& elem : elements) {
elem = sum += dist(prng);
}
sum += dist(prng);
for (auto& elem : elements) {
elem /= sum;
}
}
this example is simplified by assuming you want values in Uniform(0, 1), but it should be easy to generalise. Making this work using OMP isn't quite trivial, but shouldn't be too hard.
If you care about the last ~50% performance there are some numeric tricks that might speed up generating random deviates (e.g. there are faster and better PRNGs than the MT) as well as converting them to double
s (but recent compilers might know about these tricks). A couple of references: Daniel Lemire's blog and Melissa O'Neill's PCG site.
I've just benchmarked this and discovered that clang's std::uniform_real_distribution
and std::exponential_distribution
are both very slow. numpy
's Ziggurat based implementations are 8 times faster, such that I can generate 1e9 double
's in ~10 seconds using a single thread on my laptop (i.e. std
implementations take ~80 seconds) using the above algorithm. I've not tried OP's implementation on 1e9 elements, but with 1e8 elements mine is ~15 times faster.
Inspired by AndrisBirkmanis comments, I had a go at creating a version that would permit streaming and/or parallelisation. The idea is to split the n
items up into k
chunks. An initial amount of work has to be done (O(k)
) to figure out what the values are at these chunk boundaries and then these chunks can be processed independently. This would mean you could stream chunks back or farm them out via OMP. I wrote the code in Python, but my translation to C++ was more than twice as long so am leaving it out.
def stream_sorted_uniform(n, *, chunksize=1024):
"iterate over n sorted numbers drawn uniformly from [0, 1]"
# this code works by generating chunks, first step is to chunk up the range
# values stores variates at chunk boundaries, blocks specify indicies of variates
blocks = [0]
# these are unnormalised for now
values = [np.random.exponential()]
for i in range(0, n, chunksize):
# index of last value in this chunk
j = min(n - 1, i + chunksize)
# gamma with shape=n is the equivalent to the sum of n exponentials
blocks.append(j)
values.append(np.random.gamma(j - i))
# normalise so chunk values are in [0, 1] with appropriate gaps at either end
values = np.cumsum(values)
values /= values[-1] + np.random.exponential()
# output first element
yield values[0]
# elements in this loop can be done in parallel
for [i, mn], [j, mx] in itertools.pairwise(zip(blocks, values)):
# generate subsequent variates for this chunk
x = np.cumsum(np.random.exponential(size=j - i))
# rescale to given range, would be nice if numpy exposed a FMA operator!
x *= (mx - mn) / x[-1]
x += mn
# output elements [i+1..j]
yield from x
If you're running in a REPL, you could test via:
print(list(stream_sorted_uniform(12, chunksize=4)))
To make sure it's doing the right thing, I ran stream_sorted_uniform(301, chunksize=35)
a million times and generated a 2D histogram over the quantiles, comparing it to a naive np.sort(np.random.uniform(size=301))
. Which produces:

I'm actually showing the sqrt of the histogram counts otherwise the scale causes things to look very boring. But for reference the midpoints (i.e. index 150, 50'th percentile) have unscaled counts of 46305 and 46216 which looks like the expected Poisson noise from any random algorithm.