61

I'm an R developer who uses C for algorithmic purposes and have a question about why a C loop that seems like it would be slow is actually faster than the alternative approach.

In R, our Boolean type can actually have three values, true, false, and na, and we represent this using an int at the C level.

I'm looking into a vectorized && operation (yes, we have this in R already, but bear with me) that also handles the na case. The scalar results would look like this:

 F && F == F
 F && T == F
 F && N == F

 T && F == F
 T && T == T
 T && N == N

 N && F == F
 N && T == N
 N && N == N

Note that it works like && in C, except that na values propagate when combined with anything except false, in which case we "know" that && can never be true, so we return false.

Now to the implementation. Assume we have two vectors, v_out and v_x, and we'd like to perform the vectorized && on them. We are allowed to overwrite v_out with the result. One option is:

// Option 1
for (int i = 0; i < size; ++i) {
  int elt_out = v_out[i];
  int elt_x = v_x[i];

  if (elt_out == 0) {
    // Done
  } else if (elt_x == 0) {
    v_out[i] = 0;
  } else if (elt_out == na) {
    // Done
  } else if (elt_x == na) {
    v_out[i] = na;
  }
}

And another option is:

// Option 2
for (int i = 0; i < size; ++i) {
  int elt_out = v_out[i];

  if (elt_out == 0) {
    continue;
  }

  int elt_x = v_x[i];

  if (elt_x == 0) {
    v_out[i] = 0;
  } else if (elt_out == na) {
    // Done
  } else if (elt_x == na) {
    v_out[i] = na;
  }
}

I sort of expected the second option to be faster, because it avoids accessing v_x[i] when it isn't required. But in fact it was twice as slow when compiled with -O2!

In the following script, I get the following timing results. Note that I am on a Mac and compile with Clang.

It seems reasonable with O0. They are about the same.
2x faster with O2 with Option 1!

Option 1, `clang -O0`
0.110560

Option 2, `clang -O0`
0.107710

Option 1, `clang -O2`
0.032223

Option 2, `clang -O2`
0.070557

What is going on here? My best guess is that it has something to do with the fact that in Option 1 v_x[i] is always being accessed linearly, which is extremely fast. But in Option 2, v_x[i] is essentially being accessed randomly (sort of), because it might access v_x[10], but then not need another element from v_x until v_x[120], and because that access isn't linear, it is probably much slower.

Reproducible script:

#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
#include <time.h>

int main() {
  srand(123);

  int size = 1e7;
  int na = INT_MIN;

  int* v_out = (int*) malloc(size * sizeof(int));
  int* v_x = (int*) malloc(size * sizeof(int));

  // Generate random numbers between 1-3
  // 1 -> false
  // 2 -> true
  // 3 -> na
  for (int i = 0; i < size; ++i) {
    int elt_out = rand() % 3 + 1;

    if (elt_out == 1) {
      v_out[i] = 0;
    } else if (elt_out == 2) {
      v_out[i] = 1;
    } else {
      v_out[i] = na;
    }

    int elt_x = rand() % 3 + 1;

    if (elt_x == 1) {
      v_x[i] = 0;
    } else if (elt_x == 2) {
      v_x[i] = 1;
    } else {
      v_x[i] = na;
    }
  }

  clock_t start = clock();

  // Option 1
  for (int i = 0; i < size; ++i) {
    int elt_out = v_out[i];
    int elt_x = v_x[i];

    if (elt_out == 0) {
      // Done
    } else if (elt_x == 0) {
      v_out[i] = 0;
    } else if (elt_out == na) {
      // Done
    } else if (elt_x == na) {
      v_out[i] = na;
    }
  }

  // // Option 2
  // for (int i = 0; i < size; ++i) {
  //   int elt_out = v_out[i];
  //
  //   if (elt_out == 0) {
  //     continue;
  //   }
  //
  //   int elt_x = v_x[i];
  //
  //   if (elt_x == 0) {
  //     v_out[i] = 0;
  //   } else if (elt_out == na) {
  //     // Done
  //   } else if (elt_x == na) {
  //     v_out[i] = na;
  //   }
  // }

  clock_t end = clock();
  double time = (double) (end - start) / CLOCKS_PER_SEC;

  free(v_out);
  free(v_x);

  printf("%f\n", time);
  return 0;
}

Based on a few questions in the comments, here are a few points of clarifications for future readers:

  • I am on a 2018 15-inch MacBook Pro with a 2.9 GHz 6-core Intel i9-8950HK (6 core Coffee Lake.)

  • My particular Clang version that I tested with is Apple clang version 13.1.6 (clang-1316.0.21.2.5) with Target: x86_64-apple-darwin21.6.0

  • I am restricted by R to use int as the data type (even though there are more efficient options) and the following coding: false = 0, true = 1, na = INT_MIN. The reproducible example that I provided respects this.

  • The original question was not actually a request to make the code run faster. I just wanted to get an idea of what the difference was between my two if/else approaches. That said, some answers have shown that branchless approaches can be much faster, and I really appreciate the explanations that those users have provided! That has greatly influenced the final version of the implementation I am working on.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Davis Vaughan
  • 2,780
  • 9
  • 19
  • 1
    Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/248366/discussion-on-question-by-davis-vaughan-why-is-this-seemingly-slower-c-loop-actu). – Samuel Liew Sep 27 '22 at 06:09

4 Answers4

70

If you want fast vectorized code, don't do short-circuit evaluation, and don't branch in general. You want the compiler to be able to do 16 or 32 elements at once with SIMD operations, using 8-bit elements. (Compilers can optimize ifs to branchless code if it's safe to do the work unconditionally, including dereferences, and there aren't side-effects. This is called if-conversion, and is typically necessary for auto-vectorization of code like this.)

And you don't want the compiler to worry that it's not allowed to touch some memory because the C abstract machine doesn't. E.g., if all the v_out[i] elements are false, the v_x might be a NULL pointer without causing UB! So the compiler can't invent read access to objects that the C logic doesn't read at all.

If v_x were truly an array, not just a pointer, the compiler would know that it's readable and would be allowed to invent accesses to it by doing if-conversion of the short-circuit logic to branchless. But if its cost heuristics don't see a really big benefit (like auto-vectorization), it might choose not to. Branchy code will often in practice be slower with a random mix of trues and falses (and NAs).

As you can see in the compiler's assembly output (Clang 15 -O2 on Compiler Explorer), option 1 auto-vectorizes with SIMD, branchlessly handling 4 optional-bools in parallel (with just SSE2, more with -march=native). (Thanks to Richard in comments for cooking up the Compiler Explorer link; it's probably reflective of what Apple Clang will do to your real code in main.)


Your 3-state bool that supports an NA state can be implemented with 2 bits, in a way that bitwise AND does your && operation. You can store arrays of it as one per unsigned char, or packed 4 per char to quadruple your throughput for vectorized operations, at the cost of slower access. (Or in general CHAR_BIT/2 per char, but on mainstream C implementations for x86 that's 4.)

  • F = 00
  • N = 10 (in binary, so C 0b10 aka 2)
  • T = 11
  • conversion to bool with val & 1.
  • conversion from bool with 0b11 * b or something to broadcast the low bit to both positions.

F & anything = 0 because F is all-zero bits. N&N == N; that's trivially true for any bit-pattern. The "clever" part is that N&T = T&N = N, since the set bits in T are a superset of those in N.

This also works for || with bitwise |: F|N == N and F|T == T because 0|x == x. Also x|x == x for any same input so we're still fine there.

N = 0b10 won't set the low bit when ORing, but will clear it when ANDing.


I forgot you said C instead of C++, so this bare-bones class wrapper (only sufficient to demo a few test callers) might not be relevant, but a loop doing c1[i] &= c2[i]; in plain C for unsigned char *c1, *c2 will auto-vectorize exactly the same way.

struct NBool{ // Nullable bool, should probably rename to optional bool
    unsigned char val;
    static const unsigned char F = 0b00;
    static const unsigned char T = 0b11;
    static const unsigned char N = 0b10;  // N&T = N;  N&N = N;  N&F = F

    auto operator &=(NBool rhs){   // define && the same way if you want, as non-short-circuiting
        val &= rhs.val;
        return *this;
    }
    operator bool() { return val & 1; }

    constexpr NBool(unsigned char x) : val(x) {};
    constexpr NBool& operator=(const NBool &) = default;

};

#include <stdint.h>
#include <stdlib.h>

bool test(NBool a){
    return a;
}

bool test2(NBool a){
    NBool b = NBool::F;
    return a &= b;   // return false
}


void foo(size_t len, NBool *a1, NBool *a2 )
{
    for (std::size_t i = 0 ; i < len ; i++){
        a1[i] &= a2[i];
    }
}

(I think "Nullable" isn't really correct terminology for something that can have be NaN / NA; it's always safe to read, and it's not a reference in the first place. Maybe optional_bool, like C++ std::optional which is a value that may or may not be present.)

This compiles on Compiler Explorer with GCC and Clang. Clang auto-vectorizes fairly nicely with an unrolled loop doing vandps. (A bit of a weird choice by clang; on -march=haswell, vpand has better throughput.) But it is still limited by 1/clock store and 2/clock load anyway; this very much bottlenecks on load/store with such low computational intensity, even if the data is hot in L1d cache.

(Intel's optimization manual says that even though Skylake's peak L1d bandwidth is 2 loads + 1 store per clock (96 bytes with 32-byte vectors), the sustained bandwidth is more like 84 bytes per clock.)

It can still come relatively close to 32 bytes ANDed per clock cycle, with AVX. So that's 32 NBool & operations, or 128 per clock if you pack 4 NBools per byte.

Packing NBools down to a packed bitmap of 1-bit bools could be done with pslld xmm, 7 / pmovmskb to extract the low bit of each byte (after shifting it to the high bit).

If stored 4 per byte, some SIMD bit-manipulation is in order to pack to bools, perhaps vpshufb as a 4-bit LUT to pack pairs of NBools down to a pair of bools in the bottom of a nibble, then combine? Or use scalar BMI2 pext to extract every other bit from 64 bits, if you're on Zen 3 or Haswell or later, for fast pext.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • @KarlKnechtel: Cheers, thanks. Unfortunately I missed the detail in comments under yours that they had a storage format dictated by R, that they're still calling this on R data structures, not doing the whole computation in C, so 2-bit choices may not work as easily. If they have multiple steps of array processing before returning from a C function, I guess they could maybe pack down to bytes, maybe with unsigned saturation for the last step (`vpackuswb`) so INT_MIN becomes UCHAR_MAX, all-ones, then maybe transform with AND and `vpshufb` to map to/from this. – Peter Cordes Sep 23 '22 at 06:47
  • 1
    @Lundin: I thought about that right after posting but decided not to. The answer to the actual question asked is the first half. The question didn't ask for code at all, and the idea is fully trivial in C, just `&` on `unsigned char` elements, however you want to do it, so that's already clear from the text. As my answer says, it will optimize the same if you write something equivalent in C. This might not even be useful for the OP if they're using R's format which has huge 32-bit bools with a fixed format, 0x80000000 for NA, and the usual `bool` object representation in the low byte. – Peter Cordes Sep 23 '22 at 06:49
  • 3
    "Option 2 can't be vectorized" is the correct answer. Observe in the ASM: https://godbolt.org/z/dd7aaKxTY – Richard Sep 23 '22 at 10:33
  • @Richard: Thanks for checking, I was curious but didn't get around to looking at those myself, since I started writing this as a comment about a way to represent the 3 states efficiently, and only expanded on answering the real question after seeing that other answers didn't identify the real problem with conditional dereference. – Peter Cordes Sep 23 '22 at 10:36
  • 3
    @PeterCordes this is a great answer, thanks for taking the time to write this up. Sorry for not making it clear that I'm stuck using `int` with `true = 1`, `false = 0`, and `na = INT_MIN`. Nevertheless, I learned a lot from reading this! – Davis Vaughan Sep 23 '22 at 16:00
  • 1
    Option 2 can't be vectorized is only true when the code is compiled without AVX. When compiled targeting an AVX architecture clang can use the `vpmaskmovd` instruction, which enables it to vectorize the second function without running into the issue of touching memory because `vpmaskmovd` conditionally touches memory. So if AVX is enabled the performance story becomes very different. – user1937198 Sep 24 '22 at 03:53
  • @user1937198: Good point, yes that'll help on uarches where it's available and fast even for stores, i.e. Intel but not AMD. (e.g. one per 12 cycles for `vpmaskmovd mem, ymm` on Zen 2: https://uops.info/, for the normal case (all-true mask)). Note that integer `vpmaskmovd` requires AVX2, but `vmaskmovps` is available in AVX1. The really necessary part of AVX2 for 256-bit vectorization is `vpcmpgtd ymm` to turn `1` int `-1` via `x | cmpgt(0,x)`, producing the same 11, 10, 00 situation I suggested in my answer; at least I *hope* that's what clang is doing with it, haven't looked yet. – Peter Cordes Sep 27 '22 at 03:26
  • (Anyway yes, clang `-O2 -march=skylake` does use `vpmaskmovd`, if you don't also use `-mno-avx2`. https://godbolt.org/z/6T6bMhEs5. But since it doesn't know that only `1` and `na` are only non-`0` values possible, it does more work than a manually vectorized version would, just like in option1. – Peter Cordes Sep 27 '22 at 04:12
  • Why not compare execution time with OP's post? – darda Oct 04 '22 at 11:17
  • @darda: A good implementation of this should just bottleneck on cache bandwidth unless the data is hot in L1d cache, or maybe L2. And we can [count uops](https://stackoverflow.com/questions/51607391/what-considerations-go-into-predicting-latency-for-operations-on-supersca) per vector of work just looking at the compiler-generated asm (and https://agner.org/optimize/ / https://uops.info/), to see how "friendly" to the other hyper-thread it is, if it's sharing a physical core. I find it much more useful to look at the asm to see what the compiler did than to actually time it on a specific CPU. – Peter Cordes Oct 04 '22 at 12:22
  • Plus, if/when I get back to updating this, it'd be more useful to spend my time on writing a version of this that auto-vectorizes more efficiently, not on the details of a microbenchmark. – Peter Cordes Oct 04 '22 at 12:24
  • What is *"[NA](https://en.wikipedia.org/wiki/Three-valued_logic)"* / *[NAs](https://en.wiktionary.org/wiki/NA#Noun)*? *[Not available](https://en.wikipedia.org/wiki/Na#Other_uses)*? Do you have a reference for it? – Peter Mortensen Jan 12 '23 at 18:03
  • @PeterMortensen: Yes, not-available, as in `std::optional.has_value() == false`, similar concepts to FP Not-a-number ([NaN](https://en.wikipedia.org/wiki/NaN)) – Peter Cordes Jan 12 '23 at 21:44
23

Why is this seemingly slower C loop actually twice as fast as the other way?

At a high level, it is a quirk of the compiler and execution environment you are using. Unless array v_x is declared volatile, the compiler is free to interpret the two variations on your code exactly the same way.

I sort of expected the second option to be faster because it avoids accessing v_x[i] when it isn't required.

And if the compiler's optimizer judged that to be true, then it could make use of that judgement to conditionally avoid reading v_x[i] in conjunction with the first code.


But at a lower level, if the compiler generates code that indeed does conditionally avoid reading v_x[i] in option 2 but not in option 1, then you are probably observing the effects of branch misprediction in the option 2 case. It is entirely plausible that it is cheaper on average to read v_x[i] unconditionally than to suffer large numbers of branch misprediction penalties involving whether it should be read.

One of the takeaways is that on modern hardware, branches can be a lot more expensive than one might expect, especially when the branch is difficult for the CPU to predict. Where the same computation can be performed via a branchless approach, that may yield a performance win in practice, at, usually, a cost in source code clarity. @KarlKnechtel's answer presents a possible branchless (but for testing the for loop condition, which is pretty predictable) variation on the computation you are trying to perform.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • 2
    I've accepted this answer because I think it gets at the ethos of the question, but I also appreciate your and @KarlKnechtel's additional comments in his answer about how to optimize this code further by removing branches altogether! Thanks! – Davis Vaughan Sep 22 '22 at 17:17
  • 5
    "the compiler is free to interpret the two variations on your code exactly the same way." that is only completely true if the compiler has intimate knowledge of malloc. If it doesn;t have intimate knowlege of malloc then it can remove useless memory accesses but it can't add them. – plugwash Sep 22 '22 at 23:56
  • And the compiler -- or more properly the C implementation -- could have that knowledge, @plugwash, if indeed it were required. The remark is about C language semantics, not a specific implementation. The point is that accessing a non-`volatile` object is not part of the observable behavior of the program. Therefore, such accesses can be *both* added and omitted as long as the resulting observable behavior of the program (as that is defined by the spec) is the same as that of the abstract machine running the same program. – John Bollinger Sep 23 '22 at 03:21
  • 4
    @JohnBollinger If v_out were all 0s and v_x was 0-length, wouldn't introducing a read to v_x also introduce undefined behavior? – Aaron Dufour Sep 23 '22 at 23:44
  • No, @AaronDufour. Undefined behavior is an abstract-machine concept. The compiler cannot introduce UB because it effectively *implements* the program's behavior. Now, a buggy compiler can produce an executable that exhibits non-conforming behavior, but it is not necessary that the situation you describe would combine with the optimization I described to have that result. On the other hand, neither is it necessary that any *particular* compiler would be able to implement that optimization. – John Bollinger Sep 24 '22 at 14:24
  • 1
    @JohnBollinger The lecture on the exact meaning of undefined behavior was a great way of completely missing my point. In your comment you claimed that whether an object is accessed is not part of the observable behavior, but surely an out-of-bounds read is observable? I now suspect you meant only that _removing_ reads is okay, but that's not what you said. – Aaron Dufour Sep 25 '22 at 13:25
  • @AaronDufour, "*surely an out-of-bounds read is observable?*" -- in what way? Or to put it a different way, in what sense is the read out of bounds? I think you are not fully appreciating the distinction between abstract machine semantics and physical machine behavior, which, to be sure, is a bit subtle. – John Bollinger Sep 25 '22 at 13:30
  • 1
    Or to put it a third way, no, an out-of-bounds read is *not* an observable behavior of the program. Refer to paragraph 5.1.2.3/6 of the C17 language specification for the (quite short) list of observable behaviors. If a program has *abstract-machine* semantics that produce an out of bounds access then the program's behavior is undefined, but that is irrelevant because it is not the case here. – John Bollinger Sep 25 '22 at 15:19
  • @AaronDufour: No, "out-of-bounds" reads within the same page as known-readable data is always safe and won't fault; that's how hand-written asm for `strlen` works. ([Is it safe to read past the end of a buffer within the same page on x86 and x64?](//stackoverflow.com/q/37800739)). But yes, compilers must avoid deref of a pointer if they can't prove it's safe to deref, e.g. if C abstract machine hasn't dereferenced it. That's why compilers can't auto-vectorize the OP's option 2, as my answer explains. (Not without AVX masked load/store, like AVX2 `vpmaskmovd`.) – Peter Cordes Sep 27 '22 at 20:44
18

Note that it works like && in C except that na values propagate when combined with anything except false, in which case we "know" that && can never be true, so we return false.

Rather than represent the values as a strict enumeration, allow a numeric value of either 2 or 3 to represent na (you can check this when displaying, or have a normalization step after all the number-crunching). This way, no conditional logic (and thus no costly branch prediction) is needed: we simply logical-or the bit in the 2s place (regardless of the operator), and logical-and (or whichever operator) the bit in the 1s place.

int is_na(int value) {
    return value & 2;
}

void r_and_into(unsigned* v_out, unsigned* v_x, int size) {
  for (int i = 0; i < size; ++i) {
    unsigned elt_out = v_out[i];
    unsigned elt_x = v_x[i];
    // this can probably be micro-optimized somehow....
    v_out[i] = (elt_out & elt_x & 1) | ((elt_out | elt_x) & 2);
  }
}

If we are forced to use INT_MIN to represent the N/A value, we can start by observing what that looks like in two's complement: it has exactly one bit set (the sign bit, which would be most-significant in unsigned values). Thus, we can use that bit value instead of 2 with the same kind of unconditional logic, and then correct any (INT_MIN | 1) results to INT_MIN:

const unsigned MSB_FLAG = (unsigned)INT_MIN;

void r_and_into(int* v_out, int* v_x, int size) {
  for (int i = 0; i < size; ++i) {
    unsigned elt_out = (unsigned)v_out[i];
    unsigned elt_x = (unsigned)v_x[i];
    elt_out = (elt_out & elt_x & 1) | ((elt_out | elt_x) & MSB_FLAG);
    // if the high bit is set, clear the low bit
    // I.E.: AND the low bit with the negation of the high bit.
    v_out[i] = (int)(elt_out & ~(elt_out >> 31));
  }
}

(All these casts might not be necessary, but I think it is good practice to use unsigned types for bitwise manipulations. They should all get optimized away anyway.)

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • 1
    I'm forced (by R) to use `0 = false`, `1 = true`, `INT_MIN = na`. Can this still work? – Davis Vaughan Sep 22 '22 at 15:37
  • @DavisVaughan, in that particular case, you could try this variation: `v_out[i] = ((elt_out && elt_x) ? (elt_out | elt_x) : 0);`. Or, a bit more obscurely: `v_out[i] = (elt_out && elt_x) * (elt_out | elt_x);`. Note in particular that the ternary operator doesn't necessarily involve any branching at the CPU level, but if it does in this case then the multiplicative version definitely should not. – John Bollinger Sep 22 '22 at 15:51
  • Note also that I really do mean logical and (`&&`), not bitwise and (`&`), especially in the multiplicative version. – John Bollinger Sep 22 '22 at 16:01
  • John, hmm is that right? `INT_MIN | 1` gives me 1 more than `INT_MIN` (I think) – Davis Vaughan Sep 22 '22 at 16:09
  • @DavisVaughan, no, it's not right. Give me a minute. – John Bollinger Sep 22 '22 at 16:14
  • 1
    Ok, @DavisVaughan, what I wrote before was for NA represented as -1 (or any odd number other than 1, actually). This messier version should work for NA represented as `INT_MIN`, as is the actual case: `v_out[i] = (elt_out && elt_x) * ((elt_out & elt_x) + !(elt_out & elt_x) * INT_MIN);`. – John Bollinger Sep 22 '22 at 16:30
  • That perfectly exhibits the "cost in source code clarity" that I describe in my answer as often accompanying branchless code. – John Bollinger Sep 22 '22 at 16:43
  • 1
    I edited to try to accomodate this. – Karl Knechtel Sep 22 '22 at 16:44
  • Karl it looks like `na` with `false` are producing `na`, but they should produce `false`. `na` only propagates when it is mixed with another `na` or `true`. – Davis Vaughan Sep 22 '22 at 16:49
  • Oh? I skimmed over the truth table too quickly then. That way is probably not as easy to solve, but the same kind of reasoning might be applicable. - Yes, that should be about as easy, actually. Rather than treating the N-indicating bit as primary, we should treat the F bit that way. I'm leaving that as an exercise for now, because I need more caffeine. – Karl Knechtel Sep 22 '22 at 16:50
  • John I _think_ your solution is right, and it is extremely fast! (Note that the point of the question was really to understand what the compiler was doing in the specific case I mentioned, but I really appreciate the optimization too). Update: which means that your actual "answer" you posted is actually my favorite so far – Davis Vaughan Sep 22 '22 at 16:54
  • 2
    @KarlKnechtel: I think `F=00`, `T=0b11`, `N=0b10` is even more efficient, allowing a single bitwise AND. I just posted an answer with that. It also works for `||` as `|`, with `F|N == N` and `F|T == T` and commutative, and `x|x == x` for any same input. Also, it's a huge waste of space (thus memory bandwidth and SIMD ALU throughput) to store one bool in an `int`. – Peter Cordes Sep 22 '22 at 22:11
12

Let’s take a look at what these code samples compile to, on Clang 15.0.0 with -std=c17 -O3 -march=x86-64-v3. Other compilers will generate slightly different code; it’s finicky.

Factoring out your code snippets into functions, we get

#include <limits.h>
#include <stddef.h>

#define na INT_MIN

int* filter1( const size_t size,
              int v_out[size],
              const int v_x[size]
            )
{
  for ( size_t i = 0; i < size; ++i) {
    int elt_out = v_out[i];
    int elt_x = v_x[i];

    if (elt_out == 0) {
      // Done
    } else if (elt_x == 0) {
      v_out[i] = 0;
    } else if (elt_out == na) {
      // Done
    } else if (elt_x == na) {
      v_out[i] = na;
    }
  }
  return v_out;
}


int* filter2( const size_t size,
              int v_out[size],
              const int v_x[size]
            )
{
for (int i = 0; i < size; ++i) {
  int elt_out = v_out[i];

  if (elt_out == 0) {
    continue;
  }

  int elt_x = v_x[i];

  if (elt_x == 0) {
    v_out[i] = 0;
  } else if (elt_out == na) {
    // Done
  } else if (elt_x == na) {
    v_out[i] = na;
  }
}
  return v_out;
}

Your option 1, filter1 here, compiles to a vectorized loop on Clang 15. (GCC 12 has trouble with it.) The loop body here compiles to:

.LBB0_8:                                # =>This Inner Loop Header: Depth=1
        vmovdqu ymm3, ymmword ptr [r10 + 4*rsi - 32]
        vmovdqu ymm4, ymmword ptr [rdx + 4*rsi]
        vpcmpeqd        ymm5, ymm3, ymm0
        vpcmpeqd        ymm6, ymm4, ymm0
        vpxor   ymm7, ymm6, ymm1
        vpcmpgtd        ymm3, ymm3, ymm2
        vpcmpeqd        ymm4, ymm4, ymm2
        vpand   ymm3, ymm3, ymm4
        vpandn  ymm4, ymm5, ymm6
        vpandn  ymm5, ymm5, ymm7
        vpand   ymm3, ymm5, ymm3
        vpand   ymm5, ymm3, ymm2
        vpor    ymm3, ymm3, ymm4
        vpmaskmovd      ymmword ptr [r10 + 4*rsi - 32], ymm3, ymm5
        vmovdqu ymm3, ymmword ptr [r10 + 4*rsi]
        vmovdqu ymm4, ymmword ptr [rdx + 4*rsi + 32]
        vpcmpeqd        ymm5, ymm3, ymm0
        vpcmpeqd        ymm6, ymm4, ymm0
        vpxor   ymm7, ymm6, ymm1
        vpcmpgtd        ymm3, ymm3, ymm2
        vpcmpeqd        ymm4, ymm4, ymm2
        vpand   ymm3, ymm3, ymm4
        vpandn  ymm4, ymm5, ymm6
        vpandn  ymm5, ymm5, ymm7
        vpand   ymm3, ymm5, ymm3
        vpand   ymm5, ymm3, ymm2
        vpor    ymm3, ymm3, ymm4
        vpmaskmovd      ymmword ptr [r10 + 4*rsi], ymm3, ymm5
        add     rsi, 16
        add     r9, -2
        jne     .LBB0_8

So we see the compiler optimized the loop to a series of SIMD comparisons (vpcmpeqd instructions) to generate a bitmask that it will then use to do conditional moves with vpmaskmovd. This looks more complex than it really is, because it’s partly unrolled to do two consecutive updates per iteration.

You will note that there are no branches, other than the test at the bottom of the loop for whether we are at the end of the array. However, because of conditional moves, we can sometimes get a cache miss on a load or a store. That is what I think sometimes happened in my tests.

Now let’s look at option 2:

.LBB1_8:                                # =>This Inner Loop Header: Depth=1
        vmovdqu ymm3, ymmword ptr [r10 + 4*rsi - 32]
        vpcmpeqd        ymm4, ymm3, ymm0
        vpxor   ymm5, ymm4, ymm1
        vpmaskmovd      ymm5, ymm5, ymmword ptr [r11 + 4*rsi - 32]
        vpcmpeqd        ymm6, ymm5, ymm0
        vpxor   ymm7, ymm6, ymm1
        vpcmpgtd        ymm3, ymm3, ymm2
        vpcmpeqd        ymm5, ymm5, ymm2
        vpand   ymm3, ymm3, ymm5
        vpandn  ymm5, ymm4, ymm6
        vpandn  ymm4, ymm4, ymm7
        vpand   ymm3, ymm4, ymm3
        vpand   ymm4, ymm3, ymm2
        vpor    ymm3, ymm3, ymm5
        vpmaskmovd      ymmword ptr [r10 + 4*rsi - 32], ymm3, ymm4
        vmovdqu ymm3, ymmword ptr [r10 + 4*rsi]
        vpcmpeqd        ymm4, ymm3, ymm0
        vpxor   ymm5, ymm4, ymm1
        vpmaskmovd      ymm5, ymm5, ymmword ptr [r11 + 4*rsi]
        vpcmpeqd        ymm6, ymm5, ymm0
        vpxor   ymm7, ymm6, ymm1
        vpcmpgtd        ymm3, ymm3, ymm2
        vpcmpeqd        ymm5, ymm5, ymm2
        vpand   ymm3, ymm3, ymm5
        vpandn  ymm5, ymm4, ymm6
        vpandn  ymm4, ymm4, ymm7
        vpand   ymm3, ymm4, ymm3
        vpand   ymm4, ymm3, ymm2
        vpor    ymm3, ymm3, ymm5
        vpmaskmovd      ymmword ptr [r10 + 4*rsi], ymm3, ymm4
        add     rsi, 16
        add     r9, -2
        jne     .LBB1_8

Similar code on this compiler, but slightly longer. One difference is a conditional move from the v_x vector.

However, that is with -march=x86-64-v3. If you don’t tell it it’s allowed to use AVX2 instructions, such as vpmaskmovd, Clang 15.0.0 will give up on vectorizing this version of the algorithm at all.

For comparison, we could refactor this code, taking advantage of the fact that the updated value of v_out[i] will always be equal to either v_out[i] or v_x[i]:

int* filter3( const size_t size,
              int v_out[size],
              const int v_x[size]
            )
{
  for ( size_t i = 0; i < size; ++i) {
    const int elt_out = v_out[i];
    const int elt_x = v_x[i];

    v_out[i] = (elt_out == 0)  ? elt_out :
               (elt_x == 0)    ? elt_x :
               (elt_out == na) ? elt_out :
               (elt_x == na)   ? elt_x :
                                 elt_out;
  }
  return v_out;
}

And this gets us some very different code:

.LBB2_7:                                # =>This Inner Loop Header: Depth=1
        vmovdqu ymm6, ymmword ptr [rax + 4*rsi]
        vmovdqu ymm4, ymmword ptr [rax + 4*rsi + 32]
        vmovdqu ymm3, ymmword ptr [rax + 4*rsi + 64]
        vmovdqu ymm2, ymmword ptr [rax + 4*rsi + 96]
        vmovdqu ymm7, ymmword ptr [rdx + 4*rsi]
        vmovdqu ymm8, ymmword ptr [rdx + 4*rsi + 32]
        vmovdqu ymm9, ymmword ptr [rdx + 4*rsi + 64]
        vmovdqu ymm5, ymmword ptr [rdx + 4*rsi + 96]
        vpcmpeqd        ymm10, ymm6, ymm0
        vpcmpeqd        ymm11, ymm4, ymm0
        vpcmpeqd        ymm12, ymm3, ymm0
        vpcmpeqd        ymm13, ymm2, ymm0
        vpcmpeqd        ymm14, ymm7, ymm0
        vpor    ymm10, ymm10, ymm14
        vpcmpeqd        ymm14, ymm8, ymm0
        vpor    ymm11, ymm11, ymm14
        vpcmpeqd        ymm14, ymm9, ymm0
        vpor    ymm12, ymm12, ymm14
        vpcmpeqd        ymm14, ymm5, ymm0
        vpcmpeqd        ymm7, ymm7, ymm1
        vblendvps       ymm7, ymm6, ymm1, ymm7
        vpor    ymm13, ymm13, ymm14
        vpcmpeqd        ymm6, ymm6, ymm1
        vpandn  ymm6, ymm10, ymm6
        vpandn  ymm7, ymm10, ymm7
        vpcmpeqd        ymm8, ymm8, ymm1
        vblendvps       ymm8, ymm4, ymm1, ymm8
        vpcmpeqd        ymm4, ymm4, ymm1
        vpcmpeqd        ymm9, ymm9, ymm1
        vblendvps       ymm9, ymm3, ymm1, ymm9
        vpandn  ymm4, ymm11, ymm4
        vpandn  ymm8, ymm11, ymm8
        vpcmpeqd        ymm3, ymm3, ymm1
        vpandn  ymm3, ymm12, ymm3
        vpandn  ymm9, ymm12, ymm9
        vpcmpeqd        ymm5, ymm5, ymm1
        vblendvps       ymm5, ymm2, ymm1, ymm5
        vpcmpeqd        ymm2, ymm2, ymm1
        vpandn  ymm2, ymm13, ymm2
        vpandn  ymm5, ymm13, ymm5
        vblendvps       ymm6, ymm7, ymm1, ymm6
        vblendvps       ymm4, ymm8, ymm1, ymm4
        vblendvps       ymm3, ymm9, ymm1, ymm3
        vblendvps       ymm2, ymm5, ymm1, ymm2
        vmovups ymmword ptr [rax + 4*rsi], ymm6
        vmovups ymmword ptr [rax + 4*rsi + 32], ymm4
        vmovups ymmword ptr [rax + 4*rsi + 64], ymm3
        vmovups ymmword ptr [rax + 4*rsi + 96], ymm2
        add     rsi, 32
        cmp     r11, rsi
        jne     .LBB2_7

Although this looks longer, this is updating four vectors on each iteration, and is in fact blending the v_out and v_x vectors with a bitmask. The GCC 12.2 version of this loop follows similar logic with one update per iteration, so is more concise:

.L172:
        vmovdqu ymm3, YMMWORD PTR [rcx+rax]
        vpcmpeqd        ymm0, ymm2, YMMWORD PTR [rsi+rax]
        vpcmpeqd        ymm1, ymm3, ymm2
        vpcmpeqd        ymm6, ymm3, ymm4
        vpcmpeqd        ymm0, ymm0, ymm2
        vpcmpeqd        ymm1, ymm1, ymm2
        vpand   ymm0, ymm0, ymm1
        vpcmpeqd        ymm1, ymm4, YMMWORD PTR [rsi+rax]
        vpor    ymm1, ymm1, ymm6
        vpand   ymm6, ymm0, ymm1
        vpandn  ymm1, ymm1, ymm0
        vpxor   ymm0, ymm0, ymm5
        vpblendvb       ymm0, ymm3, ymm2, ymm0
        vpblendvb       ymm0, ymm0, ymm3, ymm1
        vpblendvb       ymm0, ymm0, ymm4, ymm6
        vmovdqu YMMWORD PTR [rcx+rax], ymm0
        add     rax, 32
        cmp     rdx, rax
        jne     .L172

This, as you see, is roughly as tight as a rolled-up version of 1 and 3 that did one update per iteration, but some optimizers seem to have less trouble with it. A similar version, whose code differs mainly in register allocations, would be:

int* filter4( const size_t size,
              int v_out[size],
              const int v_x[size]
            )
{
  for ( size_t i = 0; i < size; ++i) {
    const int elt_out = v_out[i];
    const int elt_x = v_x[i];

    v_out[i] = (elt_out == 0)  ? 0 :
               (elt_x == 0)    ? 0 :
               (elt_out == na) ? na :
               (elt_x == na)   ? na :
                                 elt_out;
  }
  return v_out;
}

The Takeaway

What seems to have happened is that your compiler was able to vectorize your version 1, but not your version 2, on the settings you were using. If it can vectorize both, they perform similarly.

In 2022, a compiler with aggressive optimization settings can turn any of these loops into vectorized branchless code, at least if you enable AVX2. If you do, the second version is capable, as you thought, of loading from v_x conditionally. (This does lead to a big observable difference when you initialize v_out to all-zeroes.) Compilers in 2022 seem to do better with the single assignment statements of version 3 and 4 than the if blocks of 1 and 2. They vectorize on some targets and settings on which 1 and 2 do not, and even when all four do, Clang 15.0.0 unrolls 3 and 4 more aggressively than 1 and 2.

With AVX-512 instructions enabled, the compiler can optimize all four versions to similar branchless code, and there isn't any significant difference in performance. With other targets (in particular, -O3 -march=x86-64-v2 and -O3 -march=x86-64-v3), Clang 15.0.0 does significantly better with versions 3 and_ 4 than 1 and 2.

However, if you are willing to change the behavior of the function for some inputs, you can remove the comparisons and conditional moves for further speedup, as in Peter Cordes’ and Karl Knechtel's answers. Here, I wanted to compare like to like.

In my testing, which version was faster depended heavily on what the input values were initialized to. With the same random seed you used, filter1 was slightly faster than the other three, but with truly randomized data, any of the four could be faster.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Davislor
  • 14,674
  • 2
  • 34
  • 49
  • `-march=x86-64-v3` may be why you aren't seeing the perfomance difference, as that enables AVX allowing clang to use `vpmaskmovd` which enables it to vectorize the second function at all. – user1937198 Sep 24 '22 at 03:50
  • @user1937198 That seems to be the case. With `-O3 -march=x86-64-v2` or lower, `filter2` does not vectorize. – Davislor Sep 26 '22 at 18:44
  • `-march=x86-64-v3` is pretty much `-march=haswell`, I think. IDK if the implied tuning is more generic; that would be nice, to have a tuning appropriate for CPUs in general that have AVX2+FMA, ignoring pitfalls of earlier CPUs (like slow unaligned 32-byte loads on Sandy/Ivy Bridge). `x86-64-v2` doesn't even have AVX, so no vectorization with 128-bit SIMD-integer instructions and AVX1 `vmaskmovps` for masked load/store. (v2 is SSE4.2, like Nehalem, and modern x86-64 with AVX disabled like Skylake Pentium/Celeron, or without it like Silvermont-family before Gracemont.) – Peter Cordes Sep 29 '22 at 04:56
  • 1
    To figure out clang's auto-vectorization strategy from looking at the asm, it can be helpful to use `-O3 -fno-unroll-loops`. Then you get to see just one iteration of the SIMD loop body. (Clang's unrolling choices normally seem pretty reasonable for performance, although maybe a bit more aggressive than necessary in some loops that aren't going to bottleneck on front-end throughput, and would only be a *bit* more hyperthreading-friendly with less loop overhead. But unrolling tiny loops by 4 seems very good.) – Peter Cordes Sep 29 '22 at 04:59
  • 1
    @PeterCordes Thanks, good advice. My takeaway is that an aggressive modern optimizer can already turn all of these loops into branchless code, but compilers in 2022 seem to do better with the single assignments in 3 and 4. than `if` blocks. 3 and 4 worked for more targets, and unrolled more optimally. I didn’t test this against Karl Knechtel’s version, in part because that has different behavior, but his probably is faster. The big gain seems to come from optimizing to vectorized branchless instructions, and the gains from fine-tuning past that point are very limited. – Davislor Sep 29 '22 at 06:46
  • If different constants are allowed, my answer shows a choice that allows implementation as `a1[i] &= a2[i]`, which of course vectorizes as a single load + `vpand`. To get better vectorization with the existing constants, you'd want to write C like `a1[i] > 0` to let it use a `pcmpgtd` against zero, instead of needing a `set1(1)` constant, and not requiring exact equality with `na` either. So perhaps `tmp = x > 0 ? -1 : 0;` `x |= tmp;` (-1 or 0 or na) `x &= y;` (1 or 0 or na). I didn't get around to updating my answer with that idea, but thought of it during a 3-day power outage (storm Fiona) – Peter Cordes Sep 29 '22 at 07:02
  • @PeterCordes I think this comes down to whether full binary compatibility is a requirement. As you say, we can speed up the loop if we are willing to change the behavior for certain inputs. I wanted to compare like to like. – Davislor Sep 29 '22 at 07:07
  • According to the OP, it has to work with `T=1, F=0, na=1u<<31` because that's what R uses, and I guess this is code to be called from R, on R data structures. But we *don't* have to deal with any `int` values with *other* bit-patterns. The C is written to check for exact equality for 3 different values, but the point of the 2nd half of my last comment was that behaviour for impossible bit-patterns doesn't matter, so we *can* use different logic that only requires 3 cheap SIMD operations per vector. The C compiler doesn't know that only 3 `int` values are possible. – Peter Cordes Sep 29 '22 at 07:12
  • @PeterCordes Fair enough. I think you covered that optimization so well in your answer, I have nothing to add to it. I learned something from testing these variations that I hope others found informative as well. – Davislor Sep 29 '22 at 07:14
  • 1
    My answer doesn't yet have a version usable with R that can auto-vectorize to `pcmpgtd` / `por` / `pand`, that's only in comments here and I haven't tested it on Godbolt. My answer just shows a non-compatible version using different bit-patterns to avoid the pcmp/por. (Because that's what I thought of first; coming up with something efficient for R's bit-patterns was harder.) But yes, my answer does cover why being branchless is good and makes things easy for the optimizer. – Peter Cordes Sep 29 '22 at 07:17
  • Aren't there supposed to be two ***aligned*** columns in the assembly listing? Eight space TAB in the original? – Peter Mortensen Jan 12 '23 at 18:29
  • @PeterMortensen I copied and pasted the output without reformatting it. Would it be much easier to read? – Davislor Jan 12 '23 at 23:21