0

I would like to vectorize an equality test in which all elements in a vector are compared against the same value, and the results are written to an array of 8-bit words. Each 8-bit word in the resulting array should be zero or one. (This is a little wasteful, but bit packing the booleans is not an import detail in this problem). This function can be written as:

#include <stdint.h>

void vecEq (uint8_t* numbers, uint8_t* results, int len, uint8_t target) {
  for(int i = 0; i < len; i++) {
    results[i] = numbers[i] == target;
  }
}

If we knew that both vectors were 256-bit aligned, we could start by broadcasting target into an AVX register and then using SIMD's _mm256_cmpeq_epi8 to perform 32 equality tests at a time. However, in the setting I'm working in, both numbers and results have been allocated by a runtime (the GHC runtime, but this is irrelevant). They are both guaranteed to be 64-bit aligned. Is there any way to vectorize this operation, preferably without using AVX registers?

The approach I've considered is broadcasting the 8-bit word to a 64-bit word up front and then XORing it with 8 elements at a time. This doesn't work though because I cannot find a vectorized way to convert the result of XOR (zero means equal, anything else means unequal) to a equality test result I need (0 means unequal, 1 means equal, nothing else should ever exist). Roughly, the sketch I have is:

void vecEq (uint64_t* numbers, uint64_t* results, int len, uint_8 target) {
  uint64_t targetA = (uint64_t)target;
  uint64_t targetB = targetA<<56 | targetA<<48 | targetA<<40 | targetA<<32 | targetA<<24 | targetA<<16 | targetA<<8 | targetA;
  for(int i = 0; i < len; i++) {
    uint64_t tmp = numbers[i] ^ targetB;
    results[i] = ... something with tmp ...;
  }
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • AVX doesn't support unaligned loads like SSE does? – Shawn Jan 05 '20 at 00:35
  • 4
    What compiler are you using and what platform are you targeting? Current versions of all three major compilers already vectorize your code as is: https://godbolt.org/z/-p_MxP… – Michael Kenzel Jan 05 '20 at 00:56
  • You can cast `uint64_t*`s to `uint8_t*`s and do the original loop, but that doesn’t change the generated (already vectorized, like Michael Kenzel said) code when I try it. Make sure to specify `restrict` if applicable to your case, though (i.e. if `numbers` and `results` can’t overlap). – Ry- Jan 05 '20 at 01:13

1 Answers1

3

Further to the comments above (the code will vectorise just fine). If you are using AVX, the best strategy is usually just to use unaligned load/store intrinsics. They have no extra cost if your data does happen to be aligned, and are as cheap as the HW can make them for cases of misalignment. (On Intel CPUs, there's still a penalty for loads/stores that span two cache lines, aka a cache-line split).

Ideally you can still align your buffers by 32, but if your data has to come from L2 or L3 or RAM, misalignment often doesn't make a measurable difference. And the best strategy for dealing with possible misalignment is usually just to let the HW handle it, instead of scalar up to an alignment boundary or something like you'd do with SSE, or with AVX512 where alignment matters again (any misalignment leads to every load/store being a cache-line split).

Just use _mm256_loadu_si256 / _mm256_storeu_si256 and forget about it.

As an interesting aside, Visual C++ will no longer emit aligned loads or stores, even if you request them. https://godbolt.org/z/pL9nw9 (e.g. vmovups instead of vmovaps)

If compiling with GCC, you probably want to use -march=haswell or -march=znver1 not just -mavx2, or at least -mno-avx256-split-unaligned-load and -mno-avx256-split-unaligned-store so 256-bit unaligned loads compile to single instructions. The CPUs that benefit from those tune=generic defaults don't support AVX2, for example Sandybridge and Piledriver.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
robthebloke
  • 9,331
  • 9
  • 12
  • Technically, whether there is a performance impact of unaligned loads is a feature of the processor model, not of AVX. And while unaligned loads often execute at the same pace as aligned loads, I believe (from memory from some time ago) they may nonetheless use more processor resources, and this can have visible performance effects in certain workloads. (Definitely there are scenarios where this is true—traversing arrays column-wise will sometimes use twice as many cache lines for unaligned loads than for aligned loads, and that can seriously impair performance.) – Eric Postpischil Jan 06 '20 at 01:05
  • 1
    Cache-line split loads are still somewhat more expensive! Unaligned loads *on aligned data* have no extra cost. So ideally you do align your buffers by 32. But anyway, yes the best strategy here is still `_mm256_loadu_si256` even when you can't guarantee alignment. See also [Why doesn't gcc resolve \_mm256\_loadu\_pd as single vmovupd?](//stackoverflow.com/q/52626726) - make sure you use `gcc -march=haswell` not just `-mavx2` or it will optimize for Sandybridge / Bulldozer and split `_mm256_loadu_si256` into `vmovdqu xmm` / `vinserti128`, even though SnB can't run AVX2 code. – Peter Cordes Jan 06 '20 at 11:55
  • 1
    @EricPostpischil: On Intel CPUs, there's literally zero extra cost for misaligned as long as you don't span a cache-line boundary. But when that is the case, the load uop has to get replayed to load from the other cache line. So besides the cache footprint and extra latency, it uses more back-end cycles on the load/store ports. ([How can I accurately benchmark unaligned access speed on x86\_64](//stackoverflow.com/q/45128763) summarizes some of the effects; they're the same for scalar and SIMD loads). – Peter Cordes Jan 06 '20 at 12:09