SIMD min slower than normal scalar

Question

I'm trying to find the minimum of an array which has exactly 4 elements. Each element is a signed int type, but only non-negative numbers are used, and -1 is used to represent an invalid value.

The instructions generated for the 2nd version is using SSE which uses SIMD shuffle and compare correctly. I expected this to run faster on Broadwell and Skylake, but when microbenchmarking it, SIMD version runs slower around 3.5ns on Skylake and 2.7ns on Broadwell.

Could you help me explaining why?

int example(int* values) {
  int min_value = 0x7FFFFFFF;
  for (int n = 0; n < 4; ++n) {
    if (values[n] != -1 &&
        values[n] < min_value) {
      min_value = values[n];
    }
  }
  return min_value;
}

https://gcc.godbolt.org/z/iWC26S

int example(int* values) {
  uint min_0 = (uint)values[0] < (uint)values[1] ? (uint)values[0] : (uint)values[1];
  uint min_1 = (uint)values[2] < (uint)values[3] ? (uint)values[2] : (uint)values[3];

  return min_0 < min_1 ? min_0 : min_1;
}

https://gcc.godbolt.org/z/b7JhNZ

My whole program https://gcc.godbolt.org/z/bUXBqd

How did you time an interval that short? Possibly you're bottlenecking on store-forwarding stalls, if you do scalar stores to create the input for the SIMD load? Once the value is loaded into a register, that sequence of pshufd/ pminud should have 4 cycle latency, and throughput of 2 cycles per iteration. (Not counting the `mov eax, xmm0`). Either way it should be faster than the cmp/cmov sequence, unless that loop (which does more work to ignore `-1`) optimizes differently when inlined... Can you provide a [mcve] of how you tested? — Peter Cordes, Jul 04 '20 at 03:04
@PeterCordes I run within 5s and count how many iterations the function executed, using https://github.com/google/benchmark. Will port my code out to give a runnable test. — daole, Jul 04 '20 at 03:09
Can you link the full version on Godbolt, including the necessary libraries? Godbolt has Google Benchmark, but it's version of Abseil doesn't define `int32`. https://gcc.godbolt.org/z/HaNR77. IDK what the point of that is vs. `int32_t` from `stdint.h` / `cstdint`. — Peter Cordes, Jul 04 '20 at 03:21
Yeah, store-forwarding is a possible explanation. If you look at the asm, it does 4 scalar dword stores right before calling `GetMinUint` which does a vector load. [The fast-case for store forwarding only applies when a load takes all its data from a *single* store](//stackoverflow.com/q/46135766). You're only testing throughput (not latency; ignoring the `GetMinUint` return value instead of using it as a dependency for next iteration) but perhaps the extra latency from slow store-forwarding could still be a problem for OoO exec. — Peter Cordes, Jul 04 '20 at 04:32
I'd test with a SIMD XorShift128+ or something ([AVX/SSE version of xorshift128+](https://stackoverflow.com/q/24001930)) because 1: that's much faster and the PRNG is timed loop, and 2: you'll do a vector store of 4 int32_t at once. If your real use-case involves recently-generated separate numbers, you're going to have to more accurately simulate your real conditions and allow inlining; call/ret overhead could be non-negligible here. — Peter Cordes, Jul 04 '20 at 04:37
Could you clarify how the scalar code isn't affected by the dword stores? — daole, Jul 04 '20 at 05:27
It's doing scalar dword loads that exactly line up with the dword stores, so the fast-case of store forwarding works. — Peter Cordes, Jul 04 '20 at 05:31
I forgot to say, if your real use-case is only sensitive to throughput (and OoO exec can hide the extra latency from slow-path store forwarding, aka store forwarding stall), it would still be better. Performance of something as short as 5 instructions can't be characterized by a single number. More like 3 factors: latency, front-end uops, and back-end ports. [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) — Peter Cordes, Jul 04 '20 at 05:34

SIMD min slower than normal scalar

0 Answers0