SIMD (AVX2, AVX512) big integer library

Question

I'm finding good SIMD (AVX2, AVX512) library with C/C++ interface (C preferred) to process big arrays of signed and unsigned big integers (mainly, 128, 256, 512 bit wides).

SIMD parallelization, obviously, must work on the array level, not on a paticular big integer value level, where it has no big sence due to limb dependency as a result of carry propagation.

I want to process big integers arrays by 4-element chunks in so-called "deinterleaved" format, e.g.:

// 4-element vector of 128-bit integers
// a2lo means third (index=2) sequential i128 element's low (lo) qword.
//
// Sequential (low-high parts interleaved) format:
// ymm0: a1hi a1lo a0hi a0lo
// ymm1: a3hi a3lo a2hi a2lo
//
// Deinterleaved format:
// ymm0: a3lo a2lo a1lo a0lo // low qwords
// ymm1: a3hi a2hi a1hi a0hi // high qwords
struct sbi_i128v4_t {
  __m256i y[2];
};

// 4-element vector of 256-bit integers
// a2[0] means third (index=2) sequential i256 element's low ([index=0]) qword.
//
// Deinterleaved format:
// ymm0: a3[0] a2[0] a1[0] a0[0] // low qwords
//       .......................
// ymm3: a3[3] a2[3] a1[3] a0[3] // high qwords
union sbi_i256v4_t {
  __m256i y[4];
  struct { sbi_i128v4_t lo, hi; };
};

My rough estimation shows that SIMDing such processing is worth to be, especially with AVX512.

Does someone know any candidates? Thanks in advance.

AFAIK there is no 64bit multiplication in SIMD, so it is hard to gain much since you first lose a lot by going to 32 bits. — Marc Glisse, Jan 22 '23 at 09:18
You are right partially, there is no hardware 64-bit multiplication units in SIMD (max exposed is 52-bit with AVX512IFMA). But if you process, say 4 bigint values, SIMD code with 32x32=64 base block will outperfrom GRP one based on MUL r64. — Akon, Jan 22 '23 at 09:50
FWIW with AVX512 it is reasonably possible to implement 512-bit carry propagation based on mask trickery. But it's probably still best to work on several different numbers. — harold, Jan 22 '23 at 12:17
Modern Intel desktop CPU can execute two 512-bit uops and three 256-bit uops per cycles. That is what I mainly took into mind when said that AVX512 has bigger potential then AVX2. I see no carry propagation facilities with AVX512 mask instructions. Thay are just an additional blending, they don't save CPU cycles when an operation is masked out. As I formulated, I want to process 4 big ints simultaneously with SIMD. — Akon, Jan 22 '23 at 12:36
I'm also not considering using AVX512IFMA with 52-bit radix for big integers, because I plan to use power 2 multiple big integers (128, 256, 512, ...), althougt, say for int208 AVX512IFMA is the best choice to implement. — Akon, Jan 22 '23 at 12:42
@Akon the mask trick is described [here](http://www.numberworld.org/y-cruncher/internals/addition.html), it doesn't really matter since you're not going to use it, but let's at least not deny its existence — harold, Jan 22 '23 at 13:14
If you hope to actually get much or any speedup from SIMD for BigInt stuff, see [Can long integer routines benefit from SSE?](https://stackoverflow.com/q/8866973) - Mysticial (author of y-cruncher) describes why partial-word (e.g. using 60 of 64 bits) gives a speedup by allowing carry-propagation to be deferred. This means users of the bigint functions have to know about deferred carry to not let it overflow. — Peter Cordes, Jan 22 '23 at 14:21
Note that a 64x64 => 128-bit scalar multiply does 4x as much work as one element of `vpmuludq` 32x32 => 64 (or 64x64 => 64-bit like with AVX-512 `vpmullq`, which runs as 3 uops on Intel CPUs, 1 uop on Zen 4 even for ZMM, but not useful since you don't get the high half). See [this](//stackoverflow.com/a/28811226) - 512-bit `vpmuludq` doing eight 32x32 => 64-bit products per uop produces twice as much useful work as one `mul` or `mulx`, but requires a lot of extra shuffling and adding to use those results. It can run 2/clock on Skylake-X with two FMA units if you didn't need to add results. — Peter Cordes, Jan 22 '23 at 14:35
But that was mostly looking at one *large* integer. If you have lots of 128-bit integers, it's possible you could gain something from SIMD, maybe in 3 chunks with AVX512IFMA52? Semi-related: arrays of 64x64 => 64-bit per-element multiply without AVX-512, it is possible to beat scalar code with AVX2 on Haswell. [Fastest way to multiply an array of int64\_t?](https://stackoverflow.com/q/37296289) But there's no equivalent for `vpmuludq` that does 64x64 => 128-bit products, so widening it will require more than twice as many instructions using 32-bit building blocks. — Peter Cordes, Jan 22 '23 at 14:38
@Peter From the past I have custom 64x64=128 "school book" algorithm SIMD mul that bases on VPMUL[U]DQ and consists of 20 instructions (AVX2) in unsigned mul case with throughput 10 cycles (SKX), i.e. 2.5 cycles per one pair (GPR MUL/IMUL r64 have throughput 1 cycle, but it's loosing its pluses at loading/storing data to/from GPR). All SIMD mul specific permutations are there. It outputs product results deinterleaved: low 64-bits in one register and high 64-bits in another one... — Akon, Jan 22 '23 at 16:27
... In the topic I mentioned about that deinterleaved format, so that 64x64=128 block is very suitable as a building block of big int mul with 2^64 radix. — Akon, Jan 22 '23 at 16:27

SIMD (AVX2, AVX512) big integer library

0 Answers0