1

I'm finding good SIMD (AVX2, AVX512) library with C/C++ interface (C preferred) to process big arrays of signed and unsigned big integers (mainly, 128, 256, 512 bit wides).

SIMD parallelization, obviously, must work on the array level, not on a paticular big integer value level, where it has no big sence due to limb dependency as a result of carry propagation.

I want to process big integers arrays by 4-element chunks in so-called "deinterleaved" format, e.g.:

// 4-element vector of 128-bit integers
// a2lo means third (index=2) sequential i128 element's low (lo) qword.
//
// Sequential (low-high parts interleaved) format:
// ymm0: a1hi a1lo a0hi a0lo
// ymm1: a3hi a3lo a2hi a2lo
//
// Deinterleaved format:
// ymm0: a3lo a2lo a1lo a0lo // low qwords
// ymm1: a3hi a2hi a1hi a0hi // high qwords
struct sbi_i128v4_t {
  __m256i y[2];
};

// 4-element vector of 256-bit integers
// a2[0] means third (index=2) sequential i256 element's low ([index=0]) qword.
//
// Deinterleaved format:
// ymm0: a3[0] a2[0] a1[0] a0[0] // low qwords
//       .......................
// ymm3: a3[3] a2[3] a1[3] a0[3] // high qwords
union sbi_i256v4_t {
  __m256i y[4];
  struct { sbi_i128v4_t lo, hi; };
};

My rough estimation shows that SIMDing such processing is worth to be, especially with AVX512.

Does someone know any candidates? Thanks in advance.

Akon
  • 335
  • 1
  • 11
  • AFAIK there is no 64bit multiplication in SIMD, so it is hard to gain much since you first lose a lot by going to 32 bits. – Marc Glisse Jan 22 '23 at 09:18
  • You are right partially, there is no hardware 64-bit multiplication units in SIMD (max exposed is 52-bit with AVX512IFMA). But if you process, say 4 bigint values, SIMD code with 32x32=64 base block will outperfrom GRP one based on MUL r64. – Akon Jan 22 '23 at 09:50
  • 1
    FWIW with AVX512 it is reasonably possible to implement 512-bit carry propagation based on mask trickery. But it's probably still best to work on several different numbers. – harold Jan 22 '23 at 12:17
  • Modern Intel desktop CPU can execute two 512-bit uops and three 256-bit uops per cycles. That is what I mainly took into mind when said that AVX512 has bigger potential then AVX2. I see no carry propagation facilities with AVX512 mask instructions. Thay are just an additional blending, they don't save CPU cycles when an operation is masked out. As I formulated, I want to process 4 big ints simultaneously with SIMD. – Akon Jan 22 '23 at 12:36
  • I'm also not considering using AVX512IFMA with 52-bit radix for big integers, because I plan to use power 2 multiple big integers (128, 256, 512, ...), althougt, say for int208 AVX512IFMA is the best choice to implement. – Akon Jan 22 '23 at 12:42
  • 1
    @Akon the mask trick is described [here](http://www.numberworld.org/y-cruncher/internals/addition.html), it doesn't really matter since you're not going to use it, but let's at least not deny its existence – harold Jan 22 '23 at 13:14
  • 1
    If you hope to actually get much or any speedup from SIMD for BigInt stuff, see [Can long integer routines benefit from SSE?](https://stackoverflow.com/q/8866973) - Mysticial (author of y-cruncher) describes why partial-word (e.g. using 60 of 64 bits) gives a speedup by allowing carry-propagation to be deferred. This means users of the bigint functions have to know about deferred carry to not let it overflow. – Peter Cordes Jan 22 '23 at 14:21
  • Note that a 64x64 => 128-bit scalar multiply does 4x as much work as one element of `vpmuludq` 32x32 => 64 (or 64x64 => 64-bit like with AVX-512 `vpmullq`, which runs as 3 uops on Intel CPUs, 1 uop on Zen 4 even for ZMM, but not useful since you don't get the high half). See [this](//stackoverflow.com/a/28811226) - 512-bit `vpmuludq` doing eight 32x32 => 64-bit products per uop produces twice as much useful work as one `mul` or `mulx`, but requires a lot of extra shuffling and adding to use those results. It can run 2/clock on Skylake-X with two FMA units if you didn't need to add results. – Peter Cordes Jan 22 '23 at 14:35
  • But that was mostly looking at one *large* integer. If you have lots of 128-bit integers, it's possible you could gain something from SIMD, maybe in 3 chunks with AVX512IFMA52? Semi-related: arrays of 64x64 => 64-bit per-element multiply without AVX-512, it is possible to beat scalar code with AVX2 on Haswell. [Fastest way to multiply an array of int64\_t?](https://stackoverflow.com/q/37296289) But there's no equivalent for `vpmuludq` that does 64x64 => 128-bit products, so widening it will require more than twice as many instructions using 32-bit building blocks. – Peter Cordes Jan 22 '23 at 14:38
  • Thanks all for your tips! – Akon Jan 22 '23 at 16:24
  • @Peter From the past I have custom 64x64=128 "school book" algorithm SIMD mul that bases on VPMUL[U]DQ and consists of 20 instructions (AVX2) in unsigned mul case with throughput 10 cycles (SKX), i.e. 2.5 cycles per one pair (GPR MUL/IMUL r64 have throughput 1 cycle, but it's loosing its pluses at loading/storing data to/from GPR). All SIMD mul specific permutations are there. It outputs product results deinterleaved: low 64-bits in one register and high 64-bits in another one... – Akon Jan 22 '23 at 16:27
  • ... In the topic I mentioned about that deinterleaved format, so that 64x64=128 block is very suitable as a building block of big int mul with 2^64 radix. – Akon Jan 22 '23 at 16:27

0 Answers0