void reset_if_true(void*& ptr, bool cond)
{
if (cond)
ptr = nullptr;
}
The naive solution will undoubtedly be the fastest in the majority of cases. Although it has a branch, which can be slow on modern pipelined processors, it is only slow if the branch is mispredicted. Since branch predictors are very good nowadays, unless the value of cond
is extremely unpredictable, it's likely that a simple conditional branch is the fastest way to write the code.
And if it is not, a good compiler should know that and be able to optimize the code to something better, considering the target architecture. Which goes to gnasher729's point: just write the code the simple way and leave optimization in the hands of the optimizer.
While this is good advice in general, sometimes it is taken too far. If you actually care about the speed of this code, you need to check and see what the compiler is actually doing with it. Check the object code that it is generating, and make sure that it is sensible and that the function's code is getting inlined.
Such an examination can be quite revealing. For example, let's consider x86-64, where branches can be quite expensive in cases where branch prediction is foiled (which is really the only time when this is an interesting question, so let's assume that cond
is completely unpredictable). Almost all compilers are going to generate the following for the naive implementation:
reset_if_true(void*&, bool):
test sil, sil ; test 'cond'
je CondIsFalse
mov QWORD PTR [rdi], 0 ; set 'ptr' to nullptr, and fall through
CondIsFalse:
ret
This is about as tight of code as you could imagine. But if you put the branch predictor in a pathological case, it might end up being slower than using a conditional move:
reset_if_true(void*&, bool):
xor eax, eax ; pre-zero the register RAX
test sil, sil ; test 'cond'
cmove rax, QWORD PTR [rdi] ; if 'cond' is false, set the register RAX to 'ptr'
mov QWORD PTR [rdi], rax ; set 'ptr' to the value in the register RAX
ret ; (which is either 'ptr' or 0)
Conditional moves have a relatively high latency, so they are considerably slower than a well-predicted branch, but they might be faster than a completely unpredictable branch. You would expect a compiler to know this when targeting the x86 architecture, but it doesn't (at least in this simple example) have any knowledge about cond
's predictability. It assumes the simple case, that branch prediction will be on your side, and generates code A instead of code B.
If you decide that you want to encourage the compiler to generate branchless code because of an unpredictable condition, you might try the following:
void reset_if_true_alt(void*& ptr, bool cond)
{
ptr = (cond) ? nullptr : ptr;
}
This succeeds in persuading modern versions of Clang to generate branchless code B, but is a complete pessimization in GCC and MSVC. If you hadn't checked the generated assembly, you wouldn't have known that. If you want to force GCC and MSVC to generate branchless code, you will have to work harder. For example, you might use the variation posted in the question:
void reset_if_true(void*& ptr, bool cond)
{
void* p[] = { ptr, nullptr };
ptr = p[cond];
}
When targeting x86, all compilers generate branchless code for that, but it is not especially pretty code. In fact, none of them generate conditional moves. Instead, you get multiple accesses to memory in order to build the array:
reset_if_true_alt(void*&, bool):
mov rax, QWORD PTR [rdi]
movzx esi, sil
mov QWORD PTR [rsp-16], 0
mov QWORD PTR [rsp-24], rax
mov rax, QWORD PTR [rsp-24+rsi*8]
mov QWORD PTR [rdi], rax
ret
Ugly and probably very inefficient. I'd predict that it gives the conditional jump version a run for its money even in the case where the branch is mispredicted. You'd have to benchmark it to be sure, of course, but it is probably not a good choice.
If you were still desperate to eliminate the branch on MSVC or GCC, you'd have to do something uglier involving reinterpreting the pointer bits and twiddling them. Something like:
void reset_if_true_alt(void*& ptr, bool cond)
{
std::uintptr_t p = reinterpret_cast<std::uintptr_t&>(ptr);
p &= -(!cond);
ptr = reinterpret_cast<void*>(p);
}
That will give you the following:
reset_if_true_alt(void*&, bool):
xor eax, eax
test sil, sil
sete al
neg eax
cdqe
and QWORD PTR [rdi], rax
ret
Again, here we've got more instructions than a simple branch, but at least they're relatively low-latency instructions. A benchmark on realistic data will tell you if the tradeoff is worth it. And give you the justification you need to put in a comment if you're going to actually check-in code like this.
Once I went down the bit-twiddling rabbit hole, I was able to force MSVC and GCC to use conditional move instructions. Apparently they weren't doing this optimization because we were dealing with a pointer:
void reset_if_true_alt(void*& ptr, bool cond)
{
std::uintptr_t p = reinterpret_cast<std::uintptr_t&>(ptr);
ptr = reinterpret_cast<void*>(cond ? 0 : p);
}
reset_if_true_alt(void*&, bool):
mov rax, QWORD PTR [rdi]
xor edx, edx
test sil, sil
cmovne rax, rdx
mov QWORD PTR [rdi], rax
ret
Given the latency of CMOVNE and the similar number of instructions, I'm not sure if this would actually be faster than the previous version. The benchmark you ran would tell you if it was.
Similarly, if we bit-twiddle the condition, we save ourselves one memory access:
void reset_if_true_alt(void*& ptr, bool cond)
{
std::uintptr_t c = (cond ? 0 : -1);
reinterpret_cast<std::uintptr_t&>(ptr) &= c;
}
reset_if_true_alt(void*&, bool):
xor esi, 1
movzx esi, sil
neg rsi
and QWORD PTR [rdi], rsi
ret
(That's GCC. MSVC does something slightly different, preferring its characteristic sequence of neg
, sbb
, neg
, and dec
instructions, but the two are morally equivalent. Clang transforms it into the same conditional move that we saw it generate above.) This may be the best code yet if we need to avoid branches, considering that it generates sane output on all tested compilers while preserving (some degree of) readability in the source code.