I'm looking for the fastest way to divide an __m256i
of packed 32-bit integers by two (aka shift right by one) using AVX. I don't have access to AVX2.
As far as I know, my options are:
- Drop down to SSE2
- Something like AVX __m256i integer division for signed 32-bit elements
In case I need to go down to SSE2 I'd appreciate the best SSE2 implementation. In case it's 2), I'd like to know the intrinsics to use and also if there's a more optimized implementation for specifically dividing by 2. Thanks!