Performance of Clang's _bitInt(256) vs Boost Multiprecision int256_t

Question

I'm after the fastest 256 bit integer library (which isn't a nightmare to integrate).

As part of this I'm trying to get a rough idea of the performance comparison between Clang's _Bitint(256) and Boost multiprecision's int256_t.

I've currently got this for Clang's _BitInt(256):

#include <cstdint>
#include <iostream>

using int256_t = signed _BitInt(256);

int main()
{
    for(int i = 0; i < 200; ++i)
    {
        // Using __rdtsc() for something non-deterministic
        const int256_t a = __rdtsc() * __rdtsc() * __rdtsc() * __rdtsc() * __rdtsc() * __rdtsc();
        const int256_t b = __rdtsc() * __rdtsc() * __rdtsc();

        const uint64_t start = __rdtsc();
        const int256_t c = a / b;
        const uint64_t finish = __rdtsc();

        std::cout << finish - start << " " << static_cast<int64_t>(c) << std::endl;
    }
}

https://godbolt.org/z/9M9TG16ax

but it looks like the divide is getting completely optimized-out? I've tried to use some randomness in the 256 bit division using __rdtsc(). I usually print the calculated value to prevent dead code elimination, but ostream isn't supported for bitint(256) so I had to do a hacky static_cast.

Could anyone suggest how I could profile this?

Or if there's any faster, header-only 256 bit integer library?

Note that `__rdtsc()`'s return type is `uint64_t` or something, so multiplying a bunch of times still produces a `uint64_t`. That's then zero-extended to `int256_t a`, so the compiler knows `a` and `b` fit in 64 bits, so the division it just uint64_t division if the compiler did its job. Try with `__rdtsc() * (int256_t)__rdtsc() * ...`. But note that you're not doing anything to block out-of-order exec across `__rdtsc()`. `lfence` could do that, but much better to create a repeat loop to get an interval long enough to time. perhaps with a `Benchmark::DoNotOptimize()` or similar. — Peter Cordes, Jun 20 '23 at 17:29
See also [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) — Peter Cordes, Jun 20 '23 at 17:32

Performance of Clang's _bitInt(256) vs Boost Multiprecision int256_t

0 Answers0