I'm finding good SIMD (AVX2, AVX512) library with C/C++ interface (C preferred) to process big arrays of signed and unsigned big integers (mainly, 128, 256, 512 bit wides).
SIMD parallelization, obviously, must work on the array level, not on a paticular big integer value level, where it has no big sence due to limb dependency as a result of carry propagation.
I want to process big integers arrays by 4-element chunks in so-called "deinterleaved" format, e.g.:
// 4-element vector of 128-bit integers
// a2lo means third (index=2) sequential i128 element's low (lo) qword.
//
// Sequential (low-high parts interleaved) format:
// ymm0: a1hi a1lo a0hi a0lo
// ymm1: a3hi a3lo a2hi a2lo
//
// Deinterleaved format:
// ymm0: a3lo a2lo a1lo a0lo // low qwords
// ymm1: a3hi a2hi a1hi a0hi // high qwords
struct sbi_i128v4_t {
__m256i y[2];
};
// 4-element vector of 256-bit integers
// a2[0] means third (index=2) sequential i256 element's low ([index=0]) qword.
//
// Deinterleaved format:
// ymm0: a3[0] a2[0] a1[0] a0[0] // low qwords
// .......................
// ymm3: a3[3] a2[3] a1[3] a0[3] // high qwords
union sbi_i256v4_t {
__m256i y[4];
struct { sbi_i128v4_t lo, hi; };
};
My rough estimation shows that SIMDing such processing is worth to be, especially with AVX512.
Does someone know any candidates? Thanks in advance.