256-bit integer operations are only added since AVX2, so you'll have to use 128-bit __m128i
vectors for integer intrinsics if you only have AVX1.
AVX1 does have integer loads/stores, and intrinsics like _mm256_set_epi32
can be implemented with FP shuffles or a simple load of a compile-time constant.
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2
Advanced Vector Extensions 2 (AVX2), also known as Haswell New Instructions,[2] is an expansion of the AVX instruction set introduced in Intel's Haswell microarchitecture. AVX2 makes the following additions:
- expansion of most vector integer SSE and AVX instructions to 256 bits
- three-operand general-purpose bit manipulation and multiply
- three-operand fused multiply-accumulate support (FMA3)
- Gather support, enabling vector elements to be loaded from non-contiguous memory locations
- DWORD- and QWORD-granularity any-to-any permutes
- vector shifts.
FMA3 is actually a separate feature; AMD Piledriver/Steamroller have it but not AVX2.
Nevertheless if the int value range fits in 24 bits then you can use float
instead. However note that if you need the exact result or the low bits of the result then you'll have to convert float
to double
, because a 24x24 multiplication will produce a 48-bit result which can be only stored exactly in a double
. At that point you still only have 4 elements per vector, and might have been better off with XMM vectors of int32
. (But note that FMA throughput is typically better than integer multiply throughput.)
AVX1 has VEX encodings of 128-bit integer operations so you can use them in the same function as 256-bit FP intrinsics without causing SSE-AVX transition stalls. (In C you generally don't have to worry about that; your compiler will take care of using vzeroupper
where needed.)
You could try to simulate an integer addition with AVX bitwise instructions like VANDPS and VXORPS, but without a bitwise left shift for ymm vectors it won't work.
If you're sure FTZ / DAZ are not set, you can use small integers as denormal / subnormal float
values, where the bits outside the mantissa are all zero. Then FP addition and integer addition are the same bitwise operation. (And VADDPS doesn't need a microcode assist on Intel hardware when the inputs and result are both denormal.)