Clang performance drop for specific C++ random number generation

Question

Using C++11's random module, I encountered an odd performance drop when using std::mt19937 (32 and 64bit versions) in combination with a uniform_real_distribution (float or double, doesn't matter). Compared to a g++ compile, it's more than an order of magnitude slower!

The culprit isn't just the mt generator, as it's fast with a uniform_int_distribution. And it isn't a general flaw in the uniform_real_distribution since that's fast with other generators like default_random_engine. Just that specific combination is oddly slow.

I'm not very familiar with the intrinsics, but the Mersenne Twister algorithm is more or less strictly defined, so a difference in implementation couldn't account for this difference I guess? measure Program is following, but here are my results for clang 3.4 and gcc 4.8.1 on a 64bit linux machine:

gcc 4.8.1
runtime_int_default: 185.6
runtime_int_mt: 179.198
runtime_int_mt_64: 175.195
runtime_float_default: 45.375
runtime_float_mt: 58.144
runtime_float_mt_64: 94.188

clang 3.4
runtime_int_default: 215.096
runtime_int_mt: 201.064
runtime_int_mt_64: 199.836
runtime_float_default: 55.143
runtime_float_mt: 744.072  <--- this and
runtime_float_mt_64: 783.293 <- this is slow

Program to generate this and try out yourself:

#include <iostream>
#include <vector>
#include <chrono>
#include <random>

template< typename T_rng, typename T_dist>
double time_rngs(T_rng& rng, T_dist& dist, int n){
    std::vector< typename T_dist::result_type > vec(n, 0);
    auto t1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < n; ++i)
        vec[i] = dist(rng);
    auto t2 = std::chrono::high_resolution_clock::now();
    auto runtime = std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()/1000.0;
    auto sum = vec[0]; //access to avoid compiler skipping
    return runtime;
}

int main(){
    const int n = 10000000;
    unsigned seed = std::chrono::system_clock::now().time_since_epoch().count();
    std::default_random_engine rng_default(seed);
    std::mt19937 rng_mt (seed);
    std::mt19937_64 rng_mt_64 (seed);
    std::uniform_int_distribution<int> dist_int(0,1000);
    std::uniform_real_distribution<float> dist_float(0.0, 1.0);

    // print max values
    std::cout << "rng_default_random.max(): " << rng_default.max() << std::endl;
    std::cout << "rng_mt.max(): " << rng_mt.max() << std::endl;
    std::cout << "rng_mt_64.max(): " << rng_mt_64.max() << std::endl << std::endl;

    std::cout << "runtime_int_default: " << time_rngs(rng_default, dist_int, n) << std::endl;
    std::cout << "runtime_int_mt: " << time_rngs(rng_mt_64, dist_int, n) << std::endl;
    std::cout << "runtime_int_mt_64: " << time_rngs(rng_mt_64, dist_int, n) << std::endl;
    std::cout << "runtime_float_default: " << time_rngs(rng_default, dist_float, n) << std::endl;
    std::cout << "runtime_float_mt: " << time_rngs(rng_mt, dist_float, n) << std::endl;
    std::cout << "runtime_float_mt_64: " << time_rngs(rng_mt_64, dist_float, n) << std::endl;
}

compile via clang++ -O3 -std=c++11 random.cpp or g++ respectively. Any ideas?

edit: Finally, Matthieu M. had a great idea: The culprit is inlining, or rather a lack thereof. Increasing the clang inlining limit eliminated the performance penalty. That actually solved a number of performance oddities I encountered. Thanks, I learned something new.

Maybe you want to profile things a bit (e.g. with callgrind) and compare generated assembler... — PlasmaHH, Apr 23 '14 at 10:12
I can only reproduce this for the `float_mt` case, not for `float_mt_64`. I used your code with clang3.4 on Fedora 20 64-bit. — Baum mit Augen, Apr 23 '14 at 12:28
Was going to say post a bug report but I saw you already did, http://llvm.org/bugs/show_bug.cgi?id=19542 — pyCthon, Apr 24 '14 at 23:27
Whats about your optimizer settings? They may have a different influence.... — Mario, May 06 '14 at 10:40
@Basti: do you know if both use libstdc++ or does Clang use libc++ ? A change of the Standard Library implementation would have huge effects, of course. As another point of comparison, you might want to try and raise the inlining level on Clang and see what happens `-mllvm -inline-treshold=10000` (for example) as I seem to remember that Clang has a lower inlining treshold than gcc by default, and this may impact further optimizations (constant propagation notably). — Matthieu M., May 10 '14 at 14:07
I don'T know about the libs. But that inlining fixed it! Wow, thankss — Basti, May 10 '14 at 17:25
be aware that increasing the inlining limit may cause issues elsewhere due to increased icache usage. — Good Person, May 10 '14 at 18:28
If you enable link time optimization with `-flto` the time for clang falls back to the expected level even without the inlining changes. Performance tuning is non-trivial. — jepio, Oct 05 '15 at 11:45

Baum mit Augen · Answer 1 · 2015-10-05T11:24:53.563

As already stated in the comments, the problem is caused by the fact that gcc inlines more aggressive than clang. If we make clang inline very aggressively, the effect disappears:

Compiling your code with g++ -O3 yields

runtime_int_default: 3000.32
runtime_int_mt: 3112.11
runtime_int_mt_64: 3069.48
runtime_float_default: 859.14
runtime_float_mt: 1027.05
runtime_float_mt_64: 1777.48

while clang++ -O3 -mllvm -inline-threshold=10000 yields

runtime_int_default: 3623.89
runtime_int_mt: 751.484
runtime_int_mt_64: 751.132
runtime_float_default: 1072.53
runtime_float_mt: 968.967
runtime_float_mt_64: 1781.34

Apparently, clang now out-inlines gcc in the int_mt cases, but all of the other runtimes are now in the same order of magnitude. I used gcc 4.8.3 and clang 3.4 on Fedora 20 64 bit.

Clang performance drop for specific C++ random number generation

1 Answers1

Linked