2

I am trying to run SIMD instruction over data types int, float and double. I need multiply, add and load operation.

For float and double I successfully managed to make those instructions work:

_mm256_add_ps, _mm256_mul_ps and _mm256_load_ps (ending *pd for double). (Direct FMADD operation isn't supported)

But for integer I couldn't find a working instruction. All of those showed at intel AVX manual give similar error by GCC 4.7 like "‘_mm256_mul_epu32’ was not declared in this scope".

For loading integer I use _mm256_set_epi32 and that's fine for GCC. I don't know why those other instructions aren't defined. Do I need to update something?

I am including all of those <pmmintrin.h>, <immintrin.h> <x86intrin.h>

My processor is an Intel core i5 3570k (Ivy Bridge).

phuclv
  • 37,963
  • 15
  • 156
  • 475
edufgf
  • 35
  • 1
  • 4

1 Answers1

10

256-bit integer operations are only added since AVX2, so you'll have to use 128-bit __m128i vectors for integer intrinsics if you only have AVX1.

AVX1 does have integer loads/stores, and intrinsics like _mm256_set_epi32 can be implemented with FP shuffles or a simple load of a compile-time constant.

https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2

Advanced Vector Extensions 2 (AVX2), also known as Haswell New Instructions,[2] is an expansion of the AVX instruction set introduced in Intel's Haswell microarchitecture. AVX2 makes the following additions:

  • expansion of most vector integer SSE and AVX instructions to 256 bits
  • three-operand general-purpose bit manipulation and multiply
  • three-operand fused multiply-accumulate support (FMA3)
  • Gather support, enabling vector elements to be loaded from non-contiguous memory locations
  • DWORD- and QWORD-granularity any-to-any permutes
  • vector shifts.

FMA3 is actually a separate feature; AMD Piledriver/Steamroller have it but not AVX2.

Nevertheless if the int value range fits in 24 bits then you can use float instead. However note that if you need the exact result or the low bits of the result then you'll have to convert float to double, because a 24x24 multiplication will produce a 48-bit result which can be only stored exactly in a double. At that point you still only have 4 elements per vector, and might have been better off with XMM vectors of int32. (But note that FMA throughput is typically better than integer multiply throughput.)

AVX1 has VEX encodings of 128-bit integer operations so you can use them in the same function as 256-bit FP intrinsics without causing SSE-AVX transition stalls. (In C you generally don't have to worry about that; your compiler will take care of using vzeroupper where needed.)

You could try to simulate an integer addition with AVX bitwise instructions like VANDPS and VXORPS, but without a bitwise left shift for ymm vectors it won't work.

If you're sure FTZ / DAZ are not set, you can use small integers as denormal / subnormal float values, where the bits outside the mantissa are all zero. Then FP addition and integer addition are the same bitwise operation. (And VADDPS doesn't need a microcode assist on Intel hardware when the inputs and result are both denormal.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
phuclv
  • 37,963
  • 15
  • 156
  • 475
  • Addition [needs the carry-out left shifted by 1 bit](https://stackoverflow.com/questions/9070937/adding-two-numbers-without-operator-clarification), so even with a LOT of VANDPS and VXORPS YMM instructions can can't simulate `vpaddb`. – Peter Cordes Jun 16 '18 at 12:31