std::uniform_real_distribution vs. std::uniform_int_distribution performance

Question

I ran into a performance degradation in one of my applications that I pinpointed to the generation of random data. I wrote a simple benchmark that essentially does the same:

#include <chrono>
#include <iostream>
#include <random>

std::mt19937 random_engine{std::random_device()()};

// Generate one million random numbers
template <typename T, typename Distribution>
std::vector<T> generate_random(Distribution distribution) {
  std::vector<T> data(1000000);

  std::generate_n(data.begin(), 1000000, [&]() {
    return static_cast<T>(distribution(random_engine));
  });
  return data;
}

template <typename T>
std::vector<T> create_data() {
  if constexpr (std::is_same_v<T, float>)
    return generate_random<float>(
        std::uniform_real_distribution<float>(-127.0f, 127.0f));
  if constexpr (std::is_same_v<T, int8_t>)
    return generate_random<int8_t>(
        std::uniform_int_distribution<int32_t>(-127, 127));
}

int main() {
  auto start = std::chrono::system_clock::now();
  auto float_data = create_data<float>();
  std::cout << "Time (float): " << (std::chrono::system_clock::now() - start).count()
            << '\n';

  start = std::chrono::system_clock::now();
  auto int8_data = create_data<int8_t>();
  std::cout << "Time (int8): " << (std::chrono::system_clock::now() - start).count()
            << '\n';

  return 0;
}

On my machine this outputs:

〉g++ -v
...
Apple clang version 11.0.3 (clang-1103.0.32.29)
Target: x86_64-apple-darwin19.5.0
...

〉g++ tmp.cpp -std=c++17 -O3 && ./a.out
Time (float): 68033
Time (int8): 172771

Why does sampling from the real distribution take less time than from the int distribution?

UPDATE

libc++ and libstdc++ show completely opposite behaviour. I'm still looking into where the difference in implementation lies. See libc++ vs. libstdc++

Measuring unoptimized code is meaningless. Turn on optimization (-O3 for instance) and see what you get then. — NathanOliver, Jul 07 '20 at 20:14
First rule of benchmarking: Always build with optimizations. There's a large difference between building with or without optimizations enables with your code (once you add the missing `` header include). By the way, I can't get your program to run, are you sure you posted a [mcve] of the actual program you run? (See e.g. [here](https://godbolt.org/z/kW6eYH) for the difference, the unoptimized code is over 2000 lines of assembler code, the optimized around 500 lines). — Some programmer dude, Jul 07 '20 at 20:15
Very true, thanks for picking that up @NathanOliver. Though with -O3 the result is the same. I updated the answer — 137, Jul 07 '20 at 20:26
@Someprogrammerdude It builds fine for me with the algorithm header — 137, Jul 07 '20 at 20:29
Your code has undefined behavior. In `std::uniform_int_distribution(127, -127)` the `-127` should be first. After fixing that you still get the same results, and I'm not sure why so I'll leave that to others. — NathanOliver, Jul 07 '20 at 20:36
@Someprogrammerdude FWIW, you can't use `int8_t` with `std::uniform_int_distribution`. — NathanOliver, Jul 07 '20 at 20:41
Could be a problem of pigeon-hole rejection with the specific limits you've chosen. Try something with a power-of-two range such as (0,128). — Mark Ransom, Jul 07 '20 at 20:46
Is the timing consistent with the number of samples? See e.g. https://quick-bench.com/q/_H_yOA9bmWj88J5ybIy5k9iW7kA or https://quick-bench.com/q/B_rOiqliHlLSO0RGl2OUBeePh5c — Bob__, Jul 07 '20 at 20:49
@Bob__ Looks like it might be a libc++ issue: https://quick-bench.com/q/inw74BXJmo7G9c6WkBdFmWvdro8 — NathanOliver, Jul 07 '20 at 20:53
@NathanOliver Wow, that's a 25x difference between the two implementations. I wonder if the fast one is also "correct". — Bob__, Jul 07 '20 at 21:01
The result seems to depend on many things. I've experimented on the compiler explorer with different variants and different compilers. Using Clang it consistently gave *smaller* numbers for `int8_t`, while GCC did the opposite. Using different implementations (like using specializations instead of `if constexpr`) made no difference *except* when adding a third test using `int32_t` where the `int8_t` case suddenly improved. Lastly, removing that `static_cast` in `generate_random` helped overall. — Some programmer dude, Jul 07 '20 at 21:02

Peter O. · Answer 1 · 2022-04-02T22:15:02.507

Recall that the C++ standard does not specify a particular algorithm for random number distributions, including uniform_int_distribution and uniform_real_distribution.

You will thus have to investigate your particular implementation of the C++ standard library (which will generally be easy for the Clang compiler, since it tends to use the open-source library libstdc++). However, there are differences between generating a floating-point number (such as float) in the interval [a, b) and generating an integer in the same interval:

Floating-point numbers: In most practical cases, there are more floating-point numbers in a given interval than there are integers in that interval. An implementation can generate "uniform" pseudo-random floating-point numbers in a range by producing a "uniform" pseudo-random floating-point number in [0, 1) (such as by using generate_canonical, whose specification is unfortunately flawed as of now), then scaling that number to fit the range given by uniform_real_distribution. This can involve the use of floating-point multiplication, division, or other operations.
Integers: Generating integers in a range usually involves generating enough random bits to fit the range, then using modulo reduction or rejection sampling (and the latter is unbiased). The process will tend not to use floating-point operations (which are relatively slow compared to integer operations), which could explain the performance difference you have discovered.

Thanks for the link. It is indeed most likely due to the libc++ implementations of the distributions. Your second point indicates that the integer distribution should be faster to generate from, which is not the case here. In fact libstdc++ and libc++ show completely opposite behaviour, so it might not be due to floating point ops being used or not during generation. I also tested the performance of double and it's slower than float but still faster than any of the integer types. — 137, Jul 08 '20 at 00:43

std::uniform_real_distribution vs. std::uniform_int_distribution performance

1 Answers1