How to write code that compiler can optimize to SIMD compare?

Question

std::array<int, 4> a = {1, 1, 1, 1};
std::array<int, 4> b = { 1, 2, 3, 4 };
std::array<int, 4> c;
bool res = false;
for (int i = 0; i < a.size(); i++) {
    a[i] = rand() % 10;
}

for (int i = 0; i < 4; i++) {
    c[i] = a[i] + b[i];
}

Smart compiler can compile above to SIMD well. But how to write the compare one like below can compile to SIMD well too;

res = a[0] <= b[0] && a[1] <= b[1] && a[2] <= b[2] && a[3] <= b[3]; // not compile to SIMD

maybe [`__attribute__ ((vector_size (16)))`](https://stackoverflow.com/q/7843597/995714) if you're using gcc? — phuclv, Mar 03 '18 at 09:38
Visual Studio 2015, x64,Full Optimization (/Ox),Advanced Vector Extensions 2 (/arch:AVX2),@BasileStarynkevitch — bitnick, Mar 03 '18 at 09:38
I wouldn't hope for much, this requires `movmskps`-ing the result of the comparison out of the vector and scalar-comparing it, that's not something I've ever seen MSVC do on its own. — harold, Mar 03 '18 at 09:42
[Auto-Vectorize comparison](https://stackoverflow.com/q/41002949/995714), [Auto-vectorization of loop containing comparisons](https://stackoverflow.com/q/45339998/995714), [how to auto vectorization array comparison function](https://stackoverflow.com/q/42146125/995714) — phuclv, Mar 03 '18 at 09:42
Meassure the Performance. Orherwise you can not tell if you did a good Job. Google Benchmark is a good Tool. I think openmp as suggested by @nemequ will Take longer than serial execution, due to the Overhead — schorsch312, Mar 03 '18 at 15:54

nemequ · Answer 1 · 2018-03-03T15:55:08.160

0

How about something like this:

int res = 0;
#pragma omp simd reduction(+:res)
for (int i = 0 ; i < 4 ; i++) {
  res += a[i] < b[i];
}

?

If you can get your input to be properly aligned (and add an aligned clause to the openmp pragma), it should be pretty quick. Especially if your input is really longer than 4 elements.

res would be 0-4 instead of 0 or 1, but that's probably not a problem. SIMD instructions tend to handle horizontal adds but not horizontal bitwise and.

edited Mar 03 '18 at 15:55

answered Mar 03 '18 at 15:43

nemequ

16,623
1
43
62

1

*SIMD instructions tend to handle horizontal adds* Uh, not x86. An `hadd` instruction exists, but it only does horizontal pairs, not a whole reduction. And more importantly, it's slower than separate shuffle + add instructions. https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86. Anyway, on x86 the best asm for what the OP wants is `pcmpgtd xmm0, xmm1` (a,b) / `pmovmskb eax, xmm0` / `cmp eax, 0xffff` / `je condition_true` (i.e. check that each element of `a` compared greater than the corresponding element of `b`, so the compare-mask is all-one) – Peter Cordes Mar 04 '18 at 04:47
So what you *really* want is a smart compiler that knows how to use SIMD compares itself, otherwise I'm not sure there's a way to hand-hold it into emitting that asm. And BTW, a horizontal AND on x86 is just as easy as horizontal add, for non-boolean vectors where you can't just extract the vector to an integer bitmap. Also BTW, NEON doesn't have an equivalent to `pmovmskb`, so turning a boolean vector into an integer bitmap takes more work. – Peter Cordes Mar 04 '18 at 04:51
1

instead of `res += a[i] < b[i];` I think it's better to use `res &= a[i] < b[i];` to get the boolean result instead of total sum – phuclv Mar 04 '18 at 12:43
Peter, I completely agree, but the question is about writing code the compiler can auto-vectorize. What I wrote is the closest I can think of without a lot of testing :( – nemequ Mar 04 '18 at 17:27
Lưu Vĩnh Phúc, `&=` won't work. Perhaps you meant `|=`? As I mentioned in my answer, I decided to use `+=` instead for a reason, but `|=` would work… it *may* just be a bit slower. – nemequ Mar 04 '18 at 17:30

How to write code that compiler can optimize to SIMD compare?

1 Answers1