5

I want to divide an AVX2 vector by a constant. I visited this question and many other pages. Saw something that might help Fixed-point arithmetic and I didn't understand. So the problem is this division is the bottleneck. I tried two ways:

First, casting to float and do the operation with AVX instruction:

//outside the bottleneck:
__m256i veci16; // containing some integer numbers (16x16-bit numbers)
__m256 div_v = _mm256_set1_ps(div);

//inside the bottlneck
//some calculations which make veci16
vecps = _mm256_castsi256_ps (veci16);
vecps = _mm256_div_ps (vecps, div_v);
veci16 = _mm256_castps_si256 (vecps);
_mm256_storeu_si256((__m256i *)&output[i][j], veci16);

With the first method, the problem is: without division elapsed time is 5ns and with this elapsed time is about 60ns.

Second, I stored to an array and loaded it like this:

int t[16] ;
inline __m256i _mm256_div_epi16 (__m256i a , int b){

    _mm256_store_si256((__m256i *)&t[0] , a);
    t[0]/=b; t[1]/=b; t[2]/=b; t[3]/=b; t[4]/=b; t[5]/=b; t[6]/=b; t[7]/=b;
    t[8]/=b; t[9]/=b; t[10]/=b; t[11]/=b; t[12]/=b; t[13]/=b; t[14]/=b; t[15]/=b;
    return _mm256_load_si256((__m256i *)&t[0]);         
}

Well, it was better. But still elapsed time is 17ns. Calculations are too much to show here.

The question is: Is there any faster way to optimize this inline function?

Community
  • 1
  • 1
Amiri
  • 2,417
  • 1
  • 15
  • 42

1 Answers1

9

You can do this with _mm256_mulhrs_epi16. This does a fixed-point multiply, so you just set the multiplicand vector to 32768 / b:

inline __m256i _mm256_div_epi16 (const __m256i va, const int b)
{
    __m256i vb = _mm256_set1_epi16(32768 / b);
    return _mm256_mulhrs_epi16(va, vb);
}

Note that this assumes b > 1.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    this answer was amazing, there is no violation the elapsed time is 5ns. – Amiri Feb 24 '17 at 15:42
  • 1
    If you need the exact result for all inputs (instead of this fast approximation with rounding error), you can use `_mm256_mulhi_epi16` and shift, using [the same multiplicative inverse trick that compilers use](https://stackoverflow.com/questions/41183935/why-does-gcc-use-multiplication-by-a-strange-number-). In fact, you can get gcc to do that for you on a `__m256i`, with GNU C native vectors. See [`vec_store_digit_and_space` in this answer](https://unix.stackexchange.com/questions/323845/whats-the-fastest-way-to-generate-a-1-gb-text-file-containing-random-digits/324520#324520) – Peter Cordes Mar 23 '18 at 04:04
  • 1
    But those techniques require significant calculation to get the constant and shift-count for a given divisor, so are most efficient with compile-time constant divisors. For divisors that are reused multiple times (e.g. runtime variable but loop invariant), see http://libdivide.com/ – Peter Cordes Mar 23 '18 at 04:08
  • Can you explain, why do you use `32768`? As said here https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3728&text=_mm256_mulhrs_epi16 `tmp[31:0] := ((a[i+15:i] * b[i+15:i]) >> 14) + 1` We can rewrite it as `result = ((a*b) >> 14)+1 = ((a*b)/16384 + 1`. So for division `a/b` we get `result = (a*32768/b)>>14 + 1 = (a*32768/b)/16384 + 1 = (a*2/b)+1` but it isn't equal to `a/b` – Alex Apr 27 '18 at 18:34
  • 1
    @Alex: you missed the part where it says (on the next line): `dst[i+15:i] := tmp[16:1]`, which, as well as extracting the required bits, also contains a shift of one more bit (this is part of the rounding operation). Note that for signed 16 bit fixed point 32768 = +1.0. – Paul R Apr 27 '18 at 18:38
  • Thanks! Yes, I missed 2nd line. Also it seems that there is such a function `__m256i _mm256_div_epi16 (__m256i a, __m256i b)` for AVX in the `#include "immintrin.h"`, but I can not find it in MCS2015: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3728,2052&text=_mm256_div_epi16 – Alex Apr 27 '18 at 19:02
  • 1
    That's an SVML library function rather than an intrinsic - you would need to be using Intel's ICC compiler to take advantage of it. And of course it will be much slower, but possibly more accurate than the above solution. – Paul R Apr 27 '18 at 19:34