clang performance drop when using uniform_real_distribution

Question

The following code results in very different times for g++ and clang++ when using uniform_real_distribution.

#include <iostream>
#include <sstream>
#include <fstream>

#include <chrono>
#include <random>


std::mt19937::result_type seed = 0;
std::mt19937 gen(seed);
// std::uniform_int_distribution<size_t> distr(0, 1);
std::uniform_real_distribution<double> distr(0.0,1.0);

int main()
{
    auto t_start = std::chrono::steady_clock::now();
    for (auto i = 1; i <= 1000000; ++i)
    {
        distr(gen);
    }
    auto t_end = std::chrono::steady_clock::now();
    std::cout << "elapsed time: " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_end - t_start).count()  << " ns\n" << std::endl;

    return 0;
}

Compiled with the following commands:

clang++ -std=c++17 -O3 -flto -march=native -mllvm -inline-threshold=10000000 rng.cpp -o rng
g++ -std=c++17 -O3 -march=native rng.cpp -o rng

this results in the following times:

clang:  272929774 ns

gcc:    12054635 ns

when using the commented distribution instead, the times are:

clang:  48155862 ns

gcc:    50226810 ns

I have found a quite old question here which handles the same problem however none of the proposed solutions worked in my case.

Clang performance drop for specific C++ random number generation

Does someone has an idea what is going on here?

`-O3` is dangerous especially when floating point is used! If you are playing with compilation flags [watch this](https://youtu.be/w5Z4JlMJ1VQ). — Marek R, Oct 08 '19 at 09:22
`-O2` lead to the same picture. Unfortunately also `distr(1.0, 2.0);` does not help. — Satas, Oct 08 '19 at 09:35
When performing these microbenchmarks, _always_ verify that your code isn't just optimized away! — Max Langhof, Oct 08 '19 at 09:46
Yes, as mentioned below, I actually would be glad if the code would be optimized away, when it is possible, so the question here should be, why clang cannot do that? — Satas, Oct 08 '19 at 10:42

Marek R · Accepted Answer · 2019-10-08T11:47:36.247

2

Take a look on godbolt

On gcc compiler trashed distr(gen);!!!

.L27:
        dec     esi
        je      .L25

This is for loop which does nothing!

On clang compiler was not smart enough:

.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        mov     edi, offset gen
        call    double std::generate_canonical<double, 53ul, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul> >(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&)
        dec     ebx
        jne     .LBB0_1

And generate_canonical was actually called.

Basically you must use result of distr(gen); to do something with it what will have impact on code outcome, otherwise compiler can remove that code.

The simplest way to fix it is to accumulate results of distr(gen); and print it.

Now when you look on assembly, you can see that clang is calling function std::generate_canonical<double, 53ul, std::mersenne_twister_engine< .... >> and gcc just placed that respective code inline.

Most probably this difference is caused by different organization of standard library. Clang used version built in into standard library and in gcc template from header file was used to generate code in just created assembly. When compiler reaches external code from library it can't tell what exactly it does, so it is unable to optimize away that code (since some side effects can be hidden in library).

edited Oct 08 '19 at 11:47

answered Oct 08 '19 at 09:36

Marek R

32,568
6
55
140

Hello, could you think of a good way do so. However, in the "real" code where the problem occurred first, I definitely use this value so probably this is not the only reason here... – Satas Oct 08 '19 at 09:49
By the way, it would be fantastic if I could find the point why clang is **not** able to optimize it away. I am very interested in this optimization. – Satas Oct 08 '19 at 10:19
>> Most probably this difference is caused by different organization of standard library. Yes, that is the point. Probably it is even a bug as it really turns down performance at this point. When performing simulations you essentially use random numbers all the time so my actual simulation became became twice as slow when using clang... Thanks for the advises and explanations. – Satas Oct 08 '19 at 12:11
Did you do measurements AFTER fix? I doubt difference is so big when `distr(gen);` wasn't discarded by gcc. – Marek R Oct 08 '19 at 12:26
Yes, I have performed the measurements with your proposed version and the results are: clang: acc: 1.50021e+06 elapsed time: 328040263 ns gcc: acc: 1.50021e+06 elapsed time: 64385650 ns – Satas Oct 08 '19 at 13:04
So clang is still about 5 times slower. – Satas Oct 08 '19 at 13:05

clang performance drop when using uniform_real_distribution

1 Answers1