Vectorization of modulo multiplication

Question

I have a function:

void Func(const int * a, const int * b, size_t size, int p, int * c)
{
    for (size_t i = 0; i < size; ++i)
        c[i] = (a[i]*b[i])%p;
}

This function performs many modulo multiplication for arrays of integer. All integers are positive. And I need to improve its performance.

I thought about SSE and AVX. But they don't have an operation to vectorize modulo multiplication. Or maybe I'm wrong?

Maybe anybody know any posibility to solve this problem?

Are you using the same `p` repeatedly? How large is `size` typically? If it's large enough or repeated often enough with the same `p`, it might even be worth JIT-compiling a loop using hard-coded vector shifts along with [the fixed-point multiplicative inverse](https://stackoverflow.com/questions/41183935/why-does-gcc-use-multiplication-by-a-strange-number-in-implementing-integer-divi). Or use http://libdivide.com/ to use multiplicative inverses without JIT, but that has more overhead (`psrlq` with imm8 count is best). It might only have an SSE2 version, not AVX2 or AVX512, though. — Peter Cordes, Oct 17 '17 at 13:35
Does `a[i]*b[i]` ever overflow? If yes, would that be ok, or do you want the result of the 64bit result mod `p`? — chtz, Oct 18 '17 at 07:02
@chtz: I think we can assume that's the existing working+tested source, where signed overflow would be UB (or in practice would wrap at 32 bits). — Peter Cordes, Oct 18 '17 at 15:30

score 10 · Accepted Answer · edited Mar 23 '18 at 03:43

At first I want to note that modulo operation can be realized with using of float point numbers:

d % p = d - int(float(d)/float(p))*p.

Although the amount of operation in right part is greater then in left one this part is preferable because it can be vectorized with using of SSE/AVX.

An implementation with SSE4.1 for 32x32 => 32-bit integer multiplication. Note that conversion from FP back to integer is done with round-to-nearest; use truncation toward zero (cvttps_epi32) if you want semantics like C float->integer conversions.

void Func(const int * a, const int * b, size_t size, int p, int * c)
{
    __m128 _k = _mm_set1_ps(1.0f / p);
    __m128i _p = _mm_set1_epi32(p);
    for (size_t i = 0; i < size; i += 4)
    {
        __m128i _a = _mm_loadu_si128((__m128i*)(a + i));
        __m128i _b = _mm_loadu_si128((__m128i*)(b + i));
        __m128i _d = _mm_mullo_epi32(_a, _b);
        __m128i _e = _mm_cvtps_epi32(_mm_mul_ps(_mm_cvtepi32_ps(_d), _k)); // e = int(float(d)/float(p));
        __m128i _c = _mm_sub_epi32(_d, _mm_mullo_epi32(_e, _p));
        _mm_storeu_si128((__m128i*)(c + i), _c);
    }            
}

An implementation with using of AVX :

void Func(const int * a, const int * b, size_t size, int p, int * c)
{
    __m256 _k = _mm256_set1_ps(1.0f / p);
    __m256i _p = _mm256_set1_epi32(p);
    for (size_t i = 0; i < size; i += 8)
    {
        __m256i _a = _mm256_loadu_si128((__m256i*)(a + i));
        __m256i _b = _mm256_loadu_si128((__m256i*)(b + i));
        __m256i _d = _mm256_mullo_epi32(_a, _b);
        __m256i _e = _mm256_cvtps_epi32(_mm256_mul_ps(_mm256_cvtepi32_ps(_d), _k)); // e = int(float(d)/float(p));
        __m256i _c = _mm256_sub_epi32(_d, _mm256_mullo_epi32(_e, _p));
        _mm256_storeu_si128((__m256i*)(c + i), _c);
    }            
}

Possibly worth converting the integer inputs to float before multiplying, on CPUs where `pmulld` is 2 uops (e.g. HSW and SKL). But `cvtdq2ps` runs on the add unit, so it has limited throughput itself. Actually it's probably break-even on HSW/SKL. — Peter Cordes, Oct 17 '17 at 13:42
@ErmIg Out of curiosity, did you perhaps do some benchmarking on the different implementations, if yes, how do they compare? — CJCombrink, Oct 17 '17 at 14:42
Unfortunately I have just written this code. I haven't checked its performance. — ErmIg, Oct 17 '17 at 14:51
Do you mean `_mm_mullo_epi32(_e, _p)` instead of ` _mm_mul_ps(_e, _p)`? Also, I guess you will have some wrong results for big `a`, `b`, `p`, since you lose 8 bit when converting to float (I did not actually test your code) — chtz, Oct 18 '17 at 07:00

score 0 · Answer 2 · answered Dec 10 '20 at 09:12

0

Actually there is an instrinsic that is performing this operation: _mm256_irem_epi32

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_irem_epi32

answered Dec 10 '20 at 09:12

Adam Szymański

47
4

That's an Intel SVML library function, not a single machine instruction. Also, for a constant `p`, you'd want to optimize by finding a multiplicative inverse once and reusing it. (Like https://libdivide.com/) – Peter Cordes Dec 10 '20 at 13:52

Vectorization of modulo multiplication

2 Answers2