AVX __m256i integer division for signed 32-bit elements

Question

I am trying to do a SIMD division in an AVX machine and getting a compilation error.

Here is my code:

    __m256i  help;
    int arr[8];
    int arr2[8];
    help = _mm256_load_si256((__m256i*)arr);
    __m256i temp;
    temp = _mm256_load_si256((__m256i*)arr2);
    __m256i result;
    _mm256_div_ps(temp,help);

And here is the error:

error: cannot convert ‘__m256i {aka __vector(4) long long int}’ to ‘__m256 {aka __vector(8) float}’ for argument ‘1’ to ‘__m256 _mm256_div_ps(__m256, __m256)’ _mm256_div_ps(temp,help);

`_mm256_div_ps` -- as the `ps` in the name suggests -- divides **P**acket **S**ingle precision floats, but not integers. If you want to approximately divide the integers, convert them to float, divide them and convert the result back. (For a better result convert to double, of course you need to split the array in two halves for that). — chtz, Feb 26 '19 at 16:43
And in case your compiler supports SVML, you could use [`_mm256_div_epi32`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SVML&cats=Arithmetic&expand=1437,2101). — chtz, Feb 26 '19 at 16:57
It sadly doesn't support SVML, but if i wanted integer division could I just do float division and cast it into an integer? Would that give me the correct result in all cases? — OgiciBumKacar, Feb 26 '19 at 20:47
If you do `float` (i.e., single-precision) division, you will get only an approximation. With `double` you should get exact results, but calculation will take about twice as long. — chtz, Feb 26 '19 at 21:44
No one seems to be explicitly mentioning it, but intel doesn't support any native simd integer division. It needs to be emulated in some way, either by falling all the way back to idiv, or by casting to and from a floating point type — Steve Cox, Feb 26 '19 at 21:51
@SteveCox: *converting* not *casting* to float. Treating integers as FP bit patterns is not useful for `divps`. Also, those aren't the only options. For repeated division by the same value, a multiplicative inverse for multiply/shift is useful. http://libdivide.com/ — Peter Cordes, Feb 27 '19 at 10:42
@VictorGubin: Can you link an example on https://godbolt.org/ where a compiler is able to auto-vectorize a loop containing integer division by a runtime variable? https://godbolt.org/z/4wAlbz shows ICC will auto-vectorize with an SVML function, but the other 3 x86 compilers just go scalar. — Peter Cordes, Feb 27 '19 at 20:14
@Peter Cordes Strange but works only with Clang :) (I think rest of the compilers will work locally as expected if openmp installed). Command line options for clang `-O3 -march=skylake -fopenmp` and the method signature `void divarr(unsigned int* __restrict__ arr,const int divisor)`. Don't forget to add `-fopenmp` compiler option, otherwise it will not be used. — Victor Gubin, Feb 27 '19 at 21:36
@VictorGubin: Ok yes, clang trunk makes different code, but it does a vector load and then scalar extract / insert to use `div`! That doesn't really count as vectorization, just a missed optimization in this case (because there's no other work using `arr[i]` in the loop). — Peter Cordes, Feb 27 '19 at 21:45

Maxim Egorushkin · Answer 1 · 2019-02-27T11:11:15.547

3

I would suggest using Vc: portable, zero-overhead C++ types for explicitly data-parallel programming library for simd, I hear it is targeted for inclusion into the C++ standard. It is easier to write and easier to read.

Example:

#include <iostream>
#include <Vc/Vc>

int main() {
    using A = Vc::SimdArray<int, 8>;
    A arr1 = A::Random();
    A arr2 = A::Random();
    std::cout << arr1 << '\n';
    std::cout << arr2 << '\n';
    std::cout << arr1 / arr2 << '\n';
}

Outputs:

<1513634383 -963914658 1763536262 -1285037745 | -695608406 -35372374 1025922083 444041308>
<824703811 1962744590 1568022524 -293901648 | 549806324 248334095 1663905340 641164273>
[1, 0, 1, 4, -1, 0, 0, 0]

The following function

using A = Vc::SimdArray<int, 8>;

__attribute__((noinline)) A f(A a0, A a1) {
    return a0 / a1;
}

With g++-8.2 -O3 -march=skylake translates into the following assembly:

f(Vc_1::SimdArray<int, 8ul, Vc_1::Vector<int, Vc_1::VectorAbi::Avx>, 8ul>, Vc_1::SimdArray<int, 8ul, Vc_1::Vector<int, Vc_1::VectorAbi::Avx>, 8ul>):
    vcvtdq2pd   ymm3, xmm1
    vcvtdq2pd   ymm2, xmm0
    vextracti128    xmm1, ymm1, 0x1
    vextracti128    xmm0, ymm0, 0x1
    vcvtdq2pd   ymm1, xmm1
    vdivpd  ymm2, ymm2, ymm3
    vcvtdq2pd   ymm0, xmm0
    vdivpd  ymm0, ymm0, ymm1
    vcvttpd2dq  xmm2, ymm2
    vcvttpd2dq  xmm0, ymm0
    vinserti128 ymm0, ymm2, xmm0, 0x1
    ret

Note that there are no simd instructions in the x86 instruction set for integer division.

edited Feb 27 '19 at 11:11

answered Feb 26 '19 at 18:15

Maxim Egorushkin

131,725
17
180
271

1

Nice and easy. It would be nice to see (in the answer) what instructions this generates. Also, a 256bit register can hold 8 int32 (but it would be confusing, IMO, if the IO-format depended on the target architecture) – chtz Feb 26 '19 at 19:51
thank you very much, I will try using this in the future. But the above question is for a homework and the teacher wants me to use the "normal" way. Do you know how I can fix the above case? – OgiciBumKacar Feb 26 '19 at 20:45
@OgiciBumKacar If you need to write this "manually", just look up the intrinsic to every instruction. If this is homework, make sure to properly cite this answer. – chtz Feb 26 '19 at 21:47
1

@Maxim Generated code looks nice. Thanks for showing that library! – chtz Feb 26 '19 at 21:49
1

I’ve made another zero-overhead library with different design goals: https://github.com/Const-me/IntelIntrinsics This one doesn’t try to encapsulate these vectors into classes, only provides wrappers for intrinsics. Integration with other libraries is easier this way, e.g. quite often I combine it with Microsoft’s DirectXMath. – Soonts Feb 26 '19 at 23:58
It does pack 8 int32 into one 256bit register. But it has to split the register when converting to `double` (and when rejoining the result), because of course only 4 doubles fit into one 256bit register (that's what the `vectracti128` and `vinserti128` instructions are doing) – chtz Feb 27 '19 at 12:32
@chtz You are quite right, it does pass the 2 arrays in `ymm0` and `ymm1` correspondingly. – Maxim Egorushkin Feb 27 '19 at 12:38
@MaximEgorushkin Documentation is 99% same as Intel intrinsics, my library does very little becides stripping prefixes into namespaces, it's in fact auto-generated library. Here's docs for Intel: https://github.com/Const-me/IntelIntrinsics/releases/download/3.4.1b/Intrinsics.3.4.1.chm Examples https://github.com/Const-me/IntelIntrinsics/tree/master/CppDemo – Soonts Feb 27 '19 at 14:13

AVX __m256i integer division for signed 32-bit elements

1 Answers1

Linked