13

OS: Linux (Debian 10)

CC: GCC 8.3

CPU: i7-5775C

There is a unsigned __int128/__int128 in GCC, but is there any way to have a uint256_t/int256_t in GCC?

I have read of a __m256i which seems to be from Intel. Is there any header that I can include to get it?

Is it as usable as a hypothetic unsigned __int256? I mean if you can assign from/to it, compare them, bitwise operations, etc.

What is its signed equivalent (if any)?


EDIT 1:

I achieved this:

#include <immintrin.h>
typedef __m256i uint256_t;

and compiled. If I can do some operations with it, I'll update it here.


EDIT 2:

Issues found:

uint256_t   m;
int         l = 5;

m = ~((uint256_t)1 << l);

ouput:

error: can’t convert a value of type ‘int’ to vector type ‘__vector(4) long long int’ which has different size
  m = ~((uint256_t)1 << l);
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    of course you can't just use `__m256i` as an integer type because it isn't an integer type but a vector, as mentioned in the error output. See [Is it possible to use SSE and SSE2 to make a 128-bit wide integer?](https://stackoverflow.com/q/12200698/995714), [Integer SIMD Instruction AVX in C](https://stackoverflow.com/q/24399754/995714), [practical BigNum AVX/SSE possible?](https://stackoverflow.com/q/27923192/995714) – phuclv Apr 23 '19 at 03:43
  • 1
    if you just want a 256-bit int type then there are a lot of duplicates [128/256-bit fixed size integer types](https://stackoverflow.com/q/5242819/995714), [Representing 128-bit numbers in C++](https://stackoverflow.com/q/1188939/995714), [C++: How do I store a 256 bit number, and how do I convert it to hex?](https://stackoverflow.com/q/5344049/995714)... – phuclv Apr 23 '19 at 03:44
  • 2
    @phuclv All those questions are C++. I'll have a look at them to see if something is useful in C though. – alx - recommends codidact Apr 23 '19 at 10:52

3 Answers3

11

Clang had _ExtInt extended integers that supports operations other than division, but SIMD isn't useful for that because of carry between elements1. Other mainstream x86-64 compilers don't even have that; you need a library or something to define a custom type and use the same add-with-carry instructions clang will use. (Or a less efficient emulation in pure C2).

This has now been renamed _BitInt(n), and will be part of ISO C23. (clang -std=gnu2x).
As an extension, clang also accepts _BitInt in C++, regardless of revision, even with -std=c++11 rather than -std=gnu++11. Also in earlier C revisions, like -std=gnu11 or -std=c11.


typedef unsigned _BitInt(256) u256;
typedef /*signed*/ _BitInt(256) i256;

// works with clang 16
// but not GCC yet (as of April 2023)

int lt0(i256 a){
    return a < 0;  // just the sign bit in the top limb
}

// -march=broadwell allows mulx which clang uses, and adox/adcx which it doesn't
u256 mul256(u256 a, u256 b){
    return a*b;
}

Godbolt with clang -std=gnu2x - works even with -m32, where it's 8x 32-bit limbs instead of just 4x 64-bit. Multiply and divide expand inline to a large amount of code, not calling helper functions, so use carefully.

  • Clang 11 and 12 supported _ExtInt(256), except for divide wider than 128. Not _BitInt.
    a<0 required explicit casts like a < (i256)0.
  • Clang 13 added implicit conversion from int to _ExtInt types. Still no divide support for integers wider than 128-bit.
  • Clang 14 and 15 support _BitInt(n), but only for sizes up to _BitInt(128), so all supported sizes support division.
  • Clang 16 and later accept unsigned _BitInt(256) bar;, including mul and div (but it's expanded inline, not a helper function, so code-size is large for those ops.)
  • GCC 12 and pre-13 trunk don't yet support _BitInt at all.

SIMD 256-bit vectors aren't 256-bit scalar integers

__m256i is AVX2 SIMD 4x uint64_t (or a narrower element size like 8x uint32_t). It's not a 256-bit scalar integer type, you can't use it for scalar operations, __m256i var = 1 won't even compile. There is no x86 SIMD support for integers wider than 64-bit, and the Intel intrinsic types like __m128i and __m256i are purely for SIMD. You can do bitwise boolean ops with them.

GCC's __int128 / unsigned __int128 typically uses scalar add/adc, and/or scalar mul / imul, because AVX2 is generally not helpful for extended precision, unless you use a partial-word storage format so you can defer carry. (SIMD is helpful for stuff like bitwise AND/OR/XOR where element boundaries are irrelevant.)


Footnote 1: There actually is some scope for using SIMD for BigInteger types, but only with a specialized format. And more importantly, you have to manually choose when to re-normalize (propagate carry) so your calculations have to be designed around it; it's not a drop-in replacement. See Mysticial's answer on Can long integer routines benefit from SSE?

Footnote 2: Unfortunately C does not provide carry-out from addition / subtraction, so it's not even convenient to write in C. sum = a+b / carry = sum<a works for carry out when there's no carry in, but it's much harder to write a full adder in C. And compiler typically make crap asm that doesn't just use native add-with-carry instructions on machines where they're available. Extended-precision libraries for very big integers, like GMP, are typically written in asm.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    But can't I use that `__m256i` in my code? I mean like `__m256i var1 = 0x7u; __m256i var2 = 0x8u; __m256i var3 = var2 & var1;` – alx - recommends codidact Apr 22 '19 at 23:37
  • 2
    @CacahueteFrito no, `__m256i` is meant for AVX2, which isn't a single 256-bit integer – phuclv Apr 23 '19 at 02:14
  • @CacahueteFrito __m256i is basically 8 x 32bit integers (and some other variants). Those 8 integers are basically separate variables. Mixing and matching those ints (as you'd need for add-with-carry) comes with a performance penalty. Communication across the 2x 128bit lanes in that type is even more expensive. You will need to find a C++ library that handles large ints using the standard 64bit/32bit integer types. – robthebloke Apr 23 '19 at 06:34
  • "You need a library that uses add-with-carry." You need a library, but how it is implemented is not that relevant. IIRC RISCV does not have add-with-carry, but you can still emulate a 256-bit integer type on it. – Marc Glisse Apr 23 '19 at 07:01
  • 1
    @MarcGlisse: if it does not have add-with-carry natively, it will have to emulate it. But you do need add-with-carry, unless you have a native 256 bit type. – Rudy Velthuis Apr 23 '19 at 09:39
  • @MarcGlisse: MIPS doesn't have add-with-carry, neither do any other ISAs without a flags register. That might include RISC-V, I forget. This was an x86-64 question (which of course does have `adc`, but I expanded the wording anyway since I was phrasing it in more of an ISA neutral way anyway. – Peter Cordes Apr 23 '19 at 22:58
  • Does clang's `_ExtInt()` have a limit on the integer size? I didn't find that in the documentation. – alx - recommends codidact Oct 27 '21 at 19:38
  • Answering my own question after some tests I did, yes, `_ExtInt()` is effectively a bignum type. Kudos to Clang. And the code is quite optimized :) – alx - recommends codidact Oct 27 '21 at 19:56
  • 1
    Clang seems to have dropped the support for the bit-length being greater than 128 starting from clang-14 (see [PR44994](https://github.com/llvm/llvm-project/issues/44994)). – Ignat Loskutov Nov 17 '22 at 10:01
6

I did need "uint256_t" only while computing "f(x) = (x^2+a) mod n" in Pollard Rho algorithm. All variables outside function "f" are of builtin type __uint128_t.

I implemented uint256_t for that purpose simply as:

typedef __uint128_t uint256_t[2];

And then I implemented the functions needed for computing "f()":

__uint128_t set_128(unsigned long h, unsigned long l);
void set_256(uint256_t d, __uint128_t l, __uint128_t h);
void add_128(uint256_t d, uint256_t x, __uint128_t a);
void add_256(uint256_t d, uint256_t x, uint256_t a);
void shl_256(uint256_t d, long s);
void sqr_128(uint256_t d, __uint128_t x);
several print functions and macros for printing 128bit and 256bit numbers
__uint128_t mod_256(uint256_t x, __uint128_t n);
__uint128_t f(__uint128_t x);

Find the implementation in this gist:
https://gist.github.com/Hermann-SW/a20af17ee6666467fe0b5c573dae701d

I did benchmark my code against gmplib functions and achieved speedup over gmplib for all (after a lot of work), for details:
https://www.raspberrypi.org/forums/viewtopic.php?f=33&t=311893&p=1873552#p1873552

Runtimes in nanoseconds for 1 million executions of a function:
enter image description here

HermannSW
  • 161
  • 1
  • 8
  • 2
    For your benchmarks that show factors of 155k speedup for plain `__uint128_t`, its likely that most of the work optimized away or was hoisted out of the loop, but the gmp function calls were opaque to the optimizer. It's impossible for a modern Intel/AMD CPU to loop faster than 1 iteration per clock cycle, or to do more than 1 scalar 64x64 => 128-bit `mul` per clock. If you're finding your code ran faster than that, it actually optimized away. (147 ns at 4GHz is only 588 cycles, totally implausible to do a million of anything without huge algorithmic optimization, i.e. not doing all the work) – Peter Cordes Jun 04 '21 at 21:08
  • See for example stuff like Google Benchmark's `DoNotOptimize()` function to make the compiler forget what it knows about a value. Q&As: [Preventing compiler optimizations while benchmarking](https://stackoverflow.com/q/40122141) / [Google Benchmark Frameworks DoNotOptimize](https://stackoverflow.com/q/66795357) – Peter Cordes Jun 04 '21 at 21:09
  • Your speedups over GMP for your `uint256_t` do look more plausible, though: especially for squaring, three multiplies and a few adds aren't much work vs. GMP's general case with loops for variable length. I'm still surprised it's as big as 20, though; perhaps *some* of the work still optimized away, or maybe there's more GMP overhead if you're not directly using `mpn_` functions. – Peter Cordes Jun 04 '21 at 21:13
  • 1
    You are right, compiler completely optimized away the multiplications: https://godbolt.org/z/584f4e9o4 I prefixed summation variable "s" with volatile, and assembler code completely changes: https://godbolt.org/z/WdnG458E1 Speedup of __uint128_t multiplication over gmplib "mpz_mul()" is now 14694553/1497909 = 9.8 times. – HermannSW Jun 05 '21 at 01:25
  • I prefixed __uint128_t variable "a" with volatile and did run benchmark again. Now speedup of uint256_t "sqr()" function over gmp_lib "mpz_mul()" is 17598169 / 3265127 = 5.39, so speedup 20 was wrong as well. – HermannSW Jun 05 '21 at 01:36
  • 1
    `volatile` might be hurting *more* than necessary, forcing extra loads as well as stopping the compiler from noticing that `a_high * a_low` is the same calculation as `a_low * a_high`, if you don't special-case it but instead rely on the compiler to notice the redundancy when inlining a normal multiply function with both args the same. Or just general all around extra work from all the extra loads every time you mention `a`. Microbenchmarking is hard: you have to check the asm to see if the work in the loop is exactly what you want to measure. (Although DoNotOptimize can help a lot.) – Peter Cordes Jun 05 '21 at 01:43
  • I forgot to mention that "mpz_mul()" does special case handling identical operands as well (in mpz/mul.c): ... if (up == vp) { mpn_sqr (wp, up, usize); cy_limb = wp[wsize - 1]; } else ... – HermannSW Jun 05 '21 at 01:48
  • GMP is an arbitrary-precision library, so obviously it's much slower than fixed-precision ones like [boost::multiprecision::uint256_t](https://www.boost.org/doc/libs/1_76_0/libs/multiprecision/doc/html/boost_multiprecision/tut/ints/cpp_int.html), https://github.com/calccrypto/uint256_t or [ttmath:UInt<4>](https://www.ttmath.org/). You should compare with fixed-width libraries instead – phuclv Jun 05 '21 at 04:43
  • Peter you are right again with "volatile might be hurting more than necessary", at least it depends where it is applied. I made the summation variable s volatile, and speedup did shrink to 9.8 times. Now instead I made variables a and b volatile, and speedup is now 13.2. That "s" works as expected can be verified by comparing sums reported in the end. Btw, what clobber does (potential use), I always do for summation variables by explicit use in print statement AFTER runtime has been determined, so that the print does not have effect on measured runtime. – HermannSW Jun 05 '21 at 15:00
  • phuclv, I did extend my performace test (summing 1 million squares of 128bit numbers) with ttmath. Compile instruction as well as test results in sqr.cpp gist: https://gist.github.com/Hermann-SW/83c8ab9e10a0bb64d770af543ed08445 To my surprise ttmath is slightly faster than gmplib with speedup 16054087 / 14803914 = 1.08. But I will go with my own uint256_t because that has speedup 16054087 / 2786120 = 5.76 over gmplib! – HermannSW Jun 06 '21 at 07:51
  • Because I run on a 64bit i7, ttmath::UInt<4> is 256bit number. – HermannSW Jun 06 '21 at 07:58
  • 1
    @HermannSW Can you please update your table with the new found "reasonable" speedup? It would be likely useful to use [table markdown](https://meta.stackexchange.com/questions/356997/new-feature-table-support) for that, too. – Simon Sobisch Sep 21 '21 at 08:13
-2

You can see a litle about on source code of BITCOIN, they use a library to C Language. If Delphi/Pascal you can use the Fundamentals5 (https://github.com/fundamentalslib/fundamentals5). Looking for, is possible to find more options of third paid libraries. There are some solutions to more than 256 bits.