3

I have a function:

void Func(const int * a, const int * b, size_t size, int p, int * c)
{
    for (size_t i = 0; i < size; ++i)
        c[i] = (a[i]*b[i])%p;
}

This function performs many modulo multiplication for arrays of integer. All integers are positive. And I need to improve its performance.

I thought about SSE and AVX. But they don't have an operation to vectorize modulo multiplication. Or maybe I'm wrong?

Maybe anybody know any posibility to solve this problem?

Michael
  • 101
  • 6
  • It's a prime integer. – Michael Oct 17 '17 at 12:58
  • 2
    Are you using the same `p` repeatedly? How large is `size` typically? If it's large enough or repeated often enough with the same `p`, it might even be worth JIT-compiling a loop using hard-coded vector shifts along with [the fixed-point multiplicative inverse](https://stackoverflow.com/questions/41183935/why-does-gcc-use-multiplication-by-a-strange-number-in-implementing-integer-divi). Or use http://libdivide.com/ to use multiplicative inverses without JIT, but that has more overhead (`psrlq` with imm8 count is best). It might only have an SSE2 version, not AVX2 or AVX512, though. – Peter Cordes Oct 17 '17 at 13:35
  • 1
    Does `a[i]*b[i]` ever overflow? If yes, would that be ok, or do you want the result of the 64bit result mod `p`? – chtz Oct 18 '17 at 07:02
  • @chtz: I think we can assume that's the existing working+tested source, where signed overflow would be UB (or in practice would wrap at 32 bits). – Peter Cordes Oct 18 '17 at 15:30

2 Answers2

10

At first I want to note that modulo operation can be realized with using of float point numbers:

d % p = d - int(float(d)/float(p))*p.

Although the amount of operation in right part is greater then in left one this part is preferable because it can be vectorized with using of SSE/AVX.

An implementation with SSE4.1 for 32x32 => 32-bit integer multiplication. Note that conversion from FP back to integer is done with round-to-nearest; use truncation toward zero (cvttps_epi32) if you want semantics like C float->integer conversions.

void Func(const int * a, const int * b, size_t size, int p, int * c)
{
    __m128 _k = _mm_set1_ps(1.0f / p);
    __m128i _p = _mm_set1_epi32(p);
    for (size_t i = 0; i < size; i += 4)
    {
        __m128i _a = _mm_loadu_si128((__m128i*)(a + i));
        __m128i _b = _mm_loadu_si128((__m128i*)(b + i));
        __m128i _d = _mm_mullo_epi32(_a, _b);
        __m128i _e = _mm_cvtps_epi32(_mm_mul_ps(_mm_cvtepi32_ps(_d), _k)); // e = int(float(d)/float(p));
        __m128i _c = _mm_sub_epi32(_d, _mm_mullo_epi32(_e, _p));
        _mm_storeu_si128((__m128i*)(c + i), _c);
    }            
}

An implementation with using of AVX :

void Func(const int * a, const int * b, size_t size, int p, int * c)
{
    __m256 _k = _mm256_set1_ps(1.0f / p);
    __m256i _p = _mm256_set1_epi32(p);
    for (size_t i = 0; i < size; i += 8)
    {
        __m256i _a = _mm256_loadu_si128((__m256i*)(a + i));
        __m256i _b = _mm256_loadu_si128((__m256i*)(b + i));
        __m256i _d = _mm256_mullo_epi32(_a, _b);
        __m256i _e = _mm256_cvtps_epi32(_mm256_mul_ps(_mm256_cvtepi32_ps(_d), _k)); // e = int(float(d)/float(p));
        __m256i _c = _mm256_sub_epi32(_d, _mm256_mullo_epi32(_e, _p));
        _mm256_storeu_si128((__m256i*)(c + i), _c);
    }            
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
ErmIg
  • 3,980
  • 1
  • 27
  • 40
  • Possibly worth converting the integer inputs to float before multiplying, on CPUs where `pmulld` is 2 uops (e.g. HSW and SKL). But `cvtdq2ps` runs on the add unit, so it has limited throughput itself. Actually it's probably break-even on HSW/SKL. – Peter Cordes Oct 17 '17 at 13:42
  • @ErmIg Out of curiosity, did you perhaps do some benchmarking on the different implementations, if yes, how do they compare? – CJCombrink Oct 17 '17 at 14:42
  • Unfortunately I have just written this code. I haven't checked its performance. – ErmIg Oct 17 '17 at 14:51
  • 2
    Do you mean `_mm_mullo_epi32(_e, _p)` instead of ` _mm_mul_ps(_e, _p)`? Also, I guess you will have some wrong results for big `a`, `b`, `p`, since you lose 8 bit when converting to float (I did not actually test your code) – chtz Oct 18 '17 at 07:00
0

Actually there is an instrinsic that is performing this operation: _mm256_irem_epi32

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_irem_epi32

  • That's an Intel SVML library function, not a single machine instruction. Also, for a constant `p`, you'd want to optimize by finding a multiplicative inverse once and reusing it. (Like https://libdivide.com/) – Peter Cordes Dec 10 '20 at 13:52