55

I have a pointer, ptr, and a condition, cond. I need the fastest possible way to reset ptr if cond is true, or keep ptr unchanged if cond is false. The current implementation is, trivially:

void reset_if_true(void*& ptr, bool cond)
{
    if (cond)
        ptr = nullptr;
}

I'm aware that the above code performance is good, and I cannot expect a major performance boost optimizing it. However, this code is called several millions of times per second and every little nanosecond saved is relevant.

I was thinking about something that get rid of the branch, e.g.:

void* p[] = { ptr, nullptr };
ptr = p[cond];

but I'm not sure it's the best way to proceed.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
shrike
  • 4,449
  • 2
  • 22
  • 38
  • Maybe inlining the function could yield a little boost. Apart from that no idea. It is probably the best you make benchmarks on your architecture, get docs on your architecture and tailor it as good as you can. – Ely Jun 21 '16 at 13:23
  • Have you tried timing the different approaches and seeing what you're getting? If you're looking to maximize the heck out of it, you'll definitely want to look at the assembly instructions your code compiles down to. Also, you could consider some branch-prediction style guessing as to whether `cond` is ever likelier to be truth-y or false-y. – Chris Sprague Jun 21 '16 at 13:27
  • 15
    I am not sure, but maybe you can save time by reducing the number of times you have to check the condition at all. Who is responsible for changing the value of the condition? In the best case you dont need to poll the condition but simply whoever is responsible for altering the condition also sets the pointer... – 463035818_is_not_an_ai Jun 21 '16 at 13:27
  • 1
    Run it through your compiler with different optimization modes and look at the machine code it generates. These optimizations may be instruction-specific so assembler might be the way to go here. Also, the generated machine/assembler code might give you some insight in possible optimizations the compiler did for you! – bzeaman Jun 21 '16 at 13:28
  • @Chris: no I didn't timed the different approaches yet; I'd like to know such different approaches first, that's what the question is intended to. The assembly may show somethings but benchmarking will be required anyway. `cond` is likely to be random, that's why I am thinking about a branchless solution. – shrike Jun 21 '16 at 13:36
  • 3
    Here's another idea: instead of passing `bool`, send in a pointer type (same size as `void *ptr`) that is set to all 1s (i.e. `0xffffffff`) in place of "true" and `0` in place of "false" wherever `cond` is set. Then you can do `ptr &= cond`. This _might_ be faster than conditional branch/assignment. – Chris Sprague Jun 21 '16 at 13:38
  • 2
    Getting rid of branch prediction would only be a real time saver if the 'cond' variable is unpredictable. You could try your second approach. You could also try something like the following: (ptr *= !cond), multiplying ptr by 0 if cond is true (making it null) and 1 if cond if false (leaving it unchanged). You might also try changing your cond to an int (I know, horrible...) to avoid a cast. But as has been said before, benchmark to be sure which is fastest. – Altainia Jun 21 '16 at 13:38
  • @Chris: Then, I'll have to make a pointer from `cond`, which either need a branch or a branchless solution just like what I am looking for :) – shrike Jun 21 '16 at 13:48
  • @Gibet: will this work if `cond`is not `constexpr` ? – shrike Jun 21 '16 at 13:49
  • Performance improvements will likely come from outside the function. Mostly around the order of calls, reducing branch prediction failures and cache misses. – OrangeDog Jun 21 '16 at 13:52
  • 1
    @shrike Well, [turns out you can't perform bitwise operations on pointers without a cast](http://stackoverflow.com/questions/15868313/why-cant-you-do-bitwise-operations-on-pointer-in-c-and-is-there-a-way-around-t), so that wouldn't have worked anyway... – Chris Sprague Jun 21 '16 at 13:52
  • 1
    @Gibet, But how do you plan on turning the runtime condition into a compile time template argument? – chris Jun 21 '16 at 14:09
  • 1
    @Gibet: Sorry, but your example does not work at all, bool in a template parameter requires the condition to be known at compile time. And this is not partial specialization, which is not allowed on template functions. – shrike Jun 21 '16 at 14:16
  • @ChrisSprague: Thanks for pointing this out ! I didn't know bitwise operations was not allowed on pointers. Actually, the `void*` in the code snippet is not related to memory at all, it's merely an identifier I could easily change to an `uintptr_t` in the code. So bitwise operations would be ok. – shrike Jun 21 '16 at 14:18
  • For future reference, I got that. It should be @ChrisSprague without the space. – chris Jun 21 '16 at 14:21
  • 1
    The fastest way to update a variable on a condition? Don't! I would be surprised if any of the ideas discussed in this thread would make a difference in the grand scheme of your application. If there was anything that would noticeably speedup that line of code, it would mean that it is a significant part of a very intensive loop... and the conditional update of the variable is likely to prevent other much more effective loop optimizations, particularly vectorization. So, I'd recommend stepping back and looking into the optimization of the enclosing loop instead. – Come Raczy Jun 21 '16 at 17:01
  • Just out of curiosity, what is your condition? Is it easy to write/express/explain? – Charles Jun 21 '16 at 17:19
  • @Charlie: the condition is simply `ptr == some_other_ptr` – shrike Jun 21 '16 at 17:47

5 Answers5

87
void reset_if_true(void*& ptr, bool cond)
{
    if (cond)
        ptr = nullptr;
}

The naive solution will undoubtedly be the fastest in the majority of cases. Although it has a branch, which can be slow on modern pipelined processors, it is only slow if the branch is mispredicted. Since branch predictors are very good nowadays, unless the value of cond is extremely unpredictable, it's likely that a simple conditional branch is the fastest way to write the code.

And if it is not, a good compiler should know that and be able to optimize the code to something better, considering the target architecture. Which goes to gnasher729's point: just write the code the simple way and leave optimization in the hands of the optimizer.

While this is good advice in general, sometimes it is taken too far. If you actually care about the speed of this code, you need to check and see what the compiler is actually doing with it. Check the object code that it is generating, and make sure that it is sensible and that the function's code is getting inlined.

Such an examination can be quite revealing. For example, let's consider x86-64, where branches can be quite expensive in cases where branch prediction is foiled (which is really the only time when this is an interesting question, so let's assume that cond is completely unpredictable). Almost all compilers are going to generate the following for the naive implementation:

reset_if_true(void*&, bool):
    test   sil, sil              ; test 'cond'
    je     CondIsFalse
    mov    QWORD PTR [rdi], 0    ; set 'ptr' to nullptr, and fall through
  CondIsFalse:
    ret

This is about as tight of code as you could imagine. But if you put the branch predictor in a pathological case, it might end up being slower than using a conditional move:

reset_if_true(void*&, bool):
    xor    eax, eax              ; pre-zero the register RAX
    test   sil, sil              ; test 'cond'
    cmove  rax, QWORD PTR [rdi]  ; if 'cond' is false, set the register RAX to 'ptr'
    mov    QWORD PTR [rdi], rax  ; set 'ptr' to the value in the register RAX
    ret                          ;  (which is either 'ptr' or 0)

Conditional moves have a relatively high latency, so they are considerably slower than a well-predicted branch, but they might be faster than a completely unpredictable branch. You would expect a compiler to know this when targeting the x86 architecture, but it doesn't (at least in this simple example) have any knowledge about cond's predictability. It assumes the simple case, that branch prediction will be on your side, and generates code A instead of code B.

If you decide that you want to encourage the compiler to generate branchless code because of an unpredictable condition, you might try the following:

void reset_if_true_alt(void*& ptr, bool cond)
{
    ptr = (cond) ? nullptr : ptr;
}

This succeeds in persuading modern versions of Clang to generate branchless code B, but is a complete pessimization in GCC and MSVC. If you hadn't checked the generated assembly, you wouldn't have known that. If you want to force GCC and MSVC to generate branchless code, you will have to work harder. For example, you might use the variation posted in the question:

void reset_if_true(void*& ptr, bool cond)
{
    void* p[] = { ptr, nullptr };
    ptr = p[cond];
}

When targeting x86, all compilers generate branchless code for that, but it is not especially pretty code. In fact, none of them generate conditional moves. Instead, you get multiple accesses to memory in order to build the array:

reset_if_true_alt(void*&, bool):
    mov     rax, QWORD PTR [rdi]
    movzx   esi, sil
    mov     QWORD PTR [rsp-16], 0
    mov     QWORD PTR [rsp-24], rax
    mov     rax, QWORD PTR [rsp-24+rsi*8]
    mov     QWORD PTR [rdi], rax
    ret

Ugly and probably very inefficient. I'd predict that it gives the conditional jump version a run for its money even in the case where the branch is mispredicted. You'd have to benchmark it to be sure, of course, but it is probably not a good choice.

If you were still desperate to eliminate the branch on MSVC or GCC, you'd have to do something uglier involving reinterpreting the pointer bits and twiddling them. Something like:

void reset_if_true_alt(void*& ptr, bool cond)
{
    std::uintptr_t p = reinterpret_cast<std::uintptr_t&>(ptr);
    p &= -(!cond);
    ptr = reinterpret_cast<void*>(p);
}

That will give you the following:

reset_if_true_alt(void*&, bool):
    xor   eax, eax
    test  sil, sil
    sete  al
    neg   eax
    cdqe
    and   QWORD PTR [rdi], rax
    ret

Again, here we've got more instructions than a simple branch, but at least they're relatively low-latency instructions. A benchmark on realistic data will tell you if the tradeoff is worth it. And give you the justification you need to put in a comment if you're going to actually check-in code like this.

Once I went down the bit-twiddling rabbit hole, I was able to force MSVC and GCC to use conditional move instructions. Apparently they weren't doing this optimization because we were dealing with a pointer:

void reset_if_true_alt(void*& ptr, bool cond)
{
    std::uintptr_t p = reinterpret_cast<std::uintptr_t&>(ptr);
    ptr = reinterpret_cast<void*>(cond ? 0 : p);
}
reset_if_true_alt(void*&, bool):
    mov    rax, QWORD PTR [rdi]
    xor    edx, edx
    test   sil, sil
    cmovne rax, rdx
    mov    QWORD PTR [rdi], rax
    ret

Given the latency of CMOVNE and the similar number of instructions, I'm not sure if this would actually be faster than the previous version. The benchmark you ran would tell you if it was.

Similarly, if we bit-twiddle the condition, we save ourselves one memory access:

void reset_if_true_alt(void*& ptr, bool cond)
{
   std::uintptr_t c = (cond ? 0 : -1);
   reinterpret_cast<std::uintptr_t&>(ptr) &= c;
}
reset_if_true_alt(void*&, bool):
     xor    esi, 1
     movzx  esi, sil
     neg    rsi
     and    QWORD PTR [rdi], rsi
     ret

(That's GCC. MSVC does something slightly different, preferring its characteristic sequence of neg, sbb, neg, and dec instructions, but the two are morally equivalent. Clang transforms it into the same conditional move that we saw it generate above.) This may be the best code yet if we need to avoid branches, considering that it generates sane output on all tested compilers while preserving (some degree of) readability in the source code.

Community
  • 1
  • 1
Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
  • In your `reset_if_true` with `cmove`, you zero out `eax`, but then read from `rax`. Won't this leave the upper 32 bits dependent on the old value of `rax`? – joelw Jun 21 '16 at 14:54
  • 1
    @joel Nope. That's a common little optimization based on the fact that, in the AMD64 architecture, [any operations on the lower 32 bits of a register automatically zero the upper 32 bits](http://stackoverflow.com/questions/11177137/why-do-most-x64-instructions-zero-the-upper-part-of-a-32-bit-register). So `xor eax, eax` is equivalent to `xor rax, rax`, but the former is shorter, which translates to faster. – Cody Gray - on strike Jun 21 '16 at 14:57
  • 4
    @CodyGray: I could not expect a better answer. I will benchmark all these solutions and edit my question to report the results. Thank you very much. – shrike Jun 21 '16 at 15:24
  • I hadn't thought of the local-array solution, nice! If the function can be nonreentrant (e.g. because the program is single-threaded), then you can save 1 `mov`, and possibly some instruction encoding bytes, by making the array `static` (or global) and just overwriting its 2nd element. – j_random_hacker Jun 21 '16 at 15:29
  • @j_random I can't take credit for that one, shrike proposed the local-array approach in the question. Your proposed optimization is a good thought, but it looks like the only thing you save is the need to `mov` 0 into the first element, which saves you 4 measly bytes. – Cody Gray - on strike Jun 21 '16 at 15:36
  • @CodyGray Interesting. Thanks. – joelw Jun 21 '16 at 15:37
  • 4
    Important to note that the `reset_if_true_alt` code variants are all no longer portable, and the last one even invokes undefined behavior. – Ben Voigt Jun 21 '16 at 15:44
  • @BenVoigt : I though this would be well defined behavior though... which paragraph of the standard states this is UB ? thanks. – shrike Jun 21 '16 at 17:17
  • 9
    @shrike: For the portability issue, a null pointer value is indicated by a compile-time constant integral `0`. The compiler will convert this to the actual bit pattern your platform uses for null pointers, which on some platforms is not all-zero. A computed zero, like the one you get using bitwise-AND, will result in an all-zero bit pattern, not the correct platform-specific pattern. For the undefined behavior, the last code snippet using `reinterpret_cast` is a violation of the strict aliasing rule. It's not even guaranteed that `uintptr_t` and `void*` are the same size. – Ben Voigt Jun 21 '16 at 17:22
  • @BenVoigt : I did not catch this issue with integral 0, thanks for pointing this out. However, could uintptr_t really not be the same size as void* !?!? I've ever though that uintptr_t was guaranteed to be the same size as pointer type... http://stackoverflow.com/questions/1845482/what-is-uintptr-t-data-type – shrike Jun 21 '16 at 17:45
  • 1
    @shrike "typically" does not mean "guaranteed". – Ben Voigt Jun 21 '16 at 17:57
  • 1
    @BenVoigt: It's portable to all "normal" CPUs, like x86, MIPS, ARM, etc. C leaves the door open for implementations where there's a real difference between pointers and integers, but like with many things that C leaves undefined, it's well-defined on most target architectures & OSes that `NULL` is represented by an all-zero bit pattern. TL;DR: Important to point this out (and to comment the code accordingly), but not a reason to avoid this code for practical use, IMO. – Peter Cordes Jun 21 '16 at 18:30
  • 2
    On AMD, and Intel Broadwell/Skylake, `cmov` has 1c latency (and 1 uop, like 3-input FMA). On previous Intel uarches, it's 2c. So `test/cmov` is 3c total at worst. But remember this function will be inlined (because that's a key optimization for a ~2 instruction function), so **the condition will already be in flags, not in an actual `bool` integer register**. So we're down to at worst 2c latency. `cmov` does turn the control dependency into a data dependency. (With perfect prediction, the branch is off the critical path from bool input to pointer output thanks to speculative execution). – Peter Cordes Jun 21 '16 at 20:03
  • 1
    One thing to keep in mind about the branching variant is the difficulty of speculating stores. OTOH, the branchless code consequently has somewhat different semantics than the branching version in a multithreaded program. – EOF Jun 21 '16 at 20:33
  • Do you know of any real world architectures or compilers where `uintptr_t` is *smaller* than a pointer, @Ben? I take your point about portability, but I think that's what shrike is asking, whether this is merely a theoretical language-lawyer matter or one of practical concern. And while you're right about the last code violating the strict aliasing rule, I imagine there's a way to write it permissibly using a union. Then again, the whole subscript to this answer is that you are basically writing C++ code in order to get specific assembly output, which means portability is not a goal. – Cody Gray - on strike Jun 22 '16 at 01:04
  • 3
    @CodyGray: It can't be smaller, since all possible pointer values have to roundtrip. Well ok, it *can* be smaller if some pointer bits are stuck-at-0, and I know lots of architectures like that, but none that take advantage of it when defining `intptr_t`. But you should also be concerned about when `intptr_t` is *larger* than a pointer (perhaps the designer chose to make it big enough to hold any function pointer also), because in that case your `reinterpret_cast(ptr) &= c;` trashes the memory adjacent to `ptr`. – Ben Voigt Jun 22 '16 at 04:06
  • I confess if I was *that* serious about what I want the assembly code to be, I would use inline assembly. –  Jun 22 '16 at 05:25
  • 4
    @Hurkyl That's a fair point, but you should also keep in mind that using inline assembly has the nearly inevitable consequence of disabling the compiler's optimizer. It just dumps out your inline assembly exactly as written. Which is fine, presumably you've written your inline assembly as optimally as possible. But the problem is that local, contextual optimizations are also disabled. The assembly instructions aren't interleaved within the surrounding code, and often aren't even inlined at all. For code like this that's called repeatedly, inline assembly almost certainly be a pessimization. – Cody Gray - on strike Jun 22 '16 at 06:15
  • @Peter I'm not completely sure what point you are trying to make with your last comment about `cmov`. I actually didn't know that Broadwell/Skylake had reduced the latency to only 1 cycle, so thanks for pointing that out. But even with that in mind, I still think it's accurate when I say that `cmov` will still be considerably slower than a well-predicted branch, since the branch would effectively be free. As such, there's no reason to use `cmov` unless you know that the nature of your condition is such that it will defeat the branch predictor. Would you not agree with that? – Cody Gray - on strike Jun 22 '16 at 06:20
  • I agree that a jcc is probably good. It's free as far as critical path latency, and it has less impact on total uop throughput required for the common case. (Free if it can micro-fuse with the instruction that set flags). My biggest point was that the bit-manipulation based on the `bool` is a bit silly, because you're optimizing it as a stand-alone function with that calling convention, not what will happen when its inlined. A cmov or a jcc are probably the best options, assuming that the condition will be in flags already, even if the pointer isn't. – Peter Cordes Jun 22 '16 at 06:34
  • @CodyGray How do you generate/display the assembly code fragment for one function only? (x86 and LLVM targets) – Sohail Si Jun 22 '16 at 12:01
  • 1
    @sohail With Clang/LLVM, you pass the `-S` switch to get an assembly code listing. I also like to pass the `-masm=intel` switch to get MASM/Intel syntax for the assembly code, rather than the unsightly GAS/AT&T syntax. That will give you an assembly code listing for all of the functions in your compiland. If you only have one function, you'll only see the assembly output for that one function. Another amazingly fun tool is http://gcc.godbolt.org, which shows the assembly output for snippets of code interactively. I've written a desktop app that does something similar for MSVC. – Cody Gray - on strike Jun 22 '16 at 12:07
  • There have been some beautiful profiles of this. Roughly, if the random branch happens 33%-66% of the time, cmov is faster. If the random branch happens 0-25% or 75%+ of the time, branching is faster. Possibly if there are patterns (like true/false/true/false) the branch predictor will be even better. The graph was this lovely curve with a strait line (for cmov). – Yakk - Adam Nevraumont Jul 26 '16 at 15:12
16

The lowest-hanging fruit here isn't what you think it is. As discussed in several other answers, reset_if_true is going to be compiled to machine code that is as fast as you can reasonably expect to get for what it does. If that's not fast enough, you need to start thinking about changing what it does. I see two options there, one easy, one not so easy:

  1. Change the calling convention:

    template <class T>
    inline T* reset_if_true(T* ptr, bool condition)
    {
        return condition ? nullptr : ptr;
    }
    

    and then change the caller(s) to read something like

    ptr_var = reset_if_true(ptr_var, expression);
    

    What this does is make it more likely that ptr_var will get to live in a register during the critical innermost loop that's calling reset_if_true millions of times a second, and there won't be any memory accesses associated with it. ptr_var getting forced out to memory is the most expensive thing in your code the way it is right now; even more expensive than potentially mispredicted branches. (A sufficiently good compiler may make this transformation for you provided reset_if_true is inlinable, but it's not always possible for it to do so.)

  2. Change the surrounding algorithm, so that reset_if_true does not get called millions of times a second anymore.

    Since you didn't tell us what the surrounding algorithm is, I can't help you with that. I can, however, tell you that doing something involving checking a condition millions of times a second, probably indicates an algorithm with quadratic time complexity or worse, and that always means you should at least think about finding a better one. (There may not be a better one, alas.)

zwol
  • 135,547
  • 38
  • 252
  • 361
  • Very informative answer, no other answer/comment did mention that explicitely; thanks. For (2), condition `cond` is tested no more than 4 to 5 times consecutively, then a new condition is tested; this happens millions of time per seconds because it is part of a highly conditional algorithm. – shrike Jun 21 '16 at 17:37
  • 1
    The change in calling convention won't matter if it's inlined. Compilers these days optimize away the store/reload round trip when inlining a function that takes some args by reference. If it's not inlined, that's your first problem, since the function is tiny. However, your suggested calling convention looks better to me. Functions that modify their args through non-constant reference only "feel" right if the arg is not a simple scalar (like a pointer or a float), or the return value needs to be something else. – Peter Cordes Jun 21 '16 at 18:36
  • @Peter : I do agree and usually do not use non const reference in function arguments. I used it here because I erroneously thought it would help compiler optimization. – shrike Jun 21 '16 at 19:41
  • You're saying that passing *through* (in param, out as return) will generate better code than using an in-out parameter? Ah, because it doesn't force it to have an address. – JDługosz Jun 21 '16 at 21:22
  • 2
    @JDługosz Yes, exactly. It _is_ a flaw in the language that in-out params force the referent to have an address (if not successfully inlined and optimized out). – zwol Jun 21 '16 at 23:29
  • 1
    @shrike The observation about the algorithm was meant to be more general than you seem to have understood it. You should be looking into ways to avoid having to test *some* condition millions of times a second, not necessarily *this one* condition. – zwol Jun 21 '16 at 23:30
  • @PeterCordes Unfortunately, depending on a bunch of other factors which are not visible in the code the OP posted (we would have to see the callers, in full) the compiler may not be able to optimize out the variable's address having been taken, even if this one function is successfully inlined. – zwol Jun 21 '16 at 23:32
  • I don't understand what your #1 is saying. Are you trying to claim that making it a template function will encourage the compiler to inline it when it wouldn't otherwise do so with a normal function? I have never seen a case where this is true. *"A sufficiently good compiler may make this transformation for you provided reset_if_true is inlinable, but it's not always possible for it to do so."* And how will a templated function change that possibility? The optimizer doesn't even *see* templated functions. It only sees *instantiated* templates, which are indistinguishable from regular fxns. – Cody Gray - on strike Jun 22 '16 at 06:22
  • 2
    @CodyGray: #1 is saying to pass by value and return by value, never taking the address (even as a reference). This would be a significant optimization for your answer, where you consider optimizing it as a non-inline function. The template stuff is just to use the same type for return value and arg. Zwol added in comments that some optimizers may fail to optimize away the need for the variable to have an address even after inlining, in more complicated circumstances. – Peter Cordes Jun 22 '16 at 12:17
11

As long as we have sizeof(size_t) == sizeof(void*), nullptr being represented in binary as 0 and size_t using all bits (or having std::uintptr_t), you can do this:

// typedef std::uintptr_t ptrint_t; // uncomment if you have it
typedef size_t ptrint_t; // comment out if you have std::uintptr_t

void reset_if_true(void*& ptr, bool cond)
{
    ((ptrint_t&)ptr) &= -ptrint_t( !cond );
}

Note, however, that the time the cast from bool to size_t takes is very much implementation-dependent and might take a branch in itself.

lorro
  • 10,687
  • 23
  • 36
  • 3
    If you want an integral type, go for `std::[u]intptr_t`. – chris Jun 21 '16 at 14:23
  • @lorro: pretty good competitor; don't know why it did not get upvotes... – shrike Jun 21 '16 at 15:27
  • 2
    This also assumes that the null pointer has the same bit pattern as integer zero, which is not always true. (it *also* assumes that `size_t` uses all of its bits, which is more likely to be true, but still not guaranteed) –  Jun 22 '16 at 05:27
  • @chris: std::[u]intptr_t is optional, thus might not exist. If it does, naturally, use it instead. – lorro Jun 22 '16 at 15:00
  • @Hurkyl: true and true, but that's the closest we could get. Updating post with assumptions. – lorro Jun 22 '16 at 15:00
5

The code is absolutely straightforward.

You certainly make things a lot faster by inlining the function (if the compiler didn't inline it on its own). For example, inlining could mean that the pointer variable that you are setting to null could stay in a register.

Other than that, this code is so straightforward, if there are any tricks that could be used to make it faster, the compiler would use them.

gnasher729
  • 51,477
  • 5
  • 75
  • 98
  • 1
    First thing is see what it was compiled to - if it's already been inlined to a single conditional store operation, then there's no point wasting effort changing the source code. – OrangeDog Jun 21 '16 at 13:54
  • @gnasher: I assume the function `reset_if_true()` is inlined when compiling with full optimizations; do I really need to specify `inline` qualifier anyway ? – shrike Jun 21 '16 at 13:54
  • @shrike You should be sure to hint to your compiler that you want it to be inline'd regardless ;) – Chris Sprague Jun 21 '16 at 13:55
  • 2
    The inline version will still have the branching. – Sohail Si Jun 21 '16 at 14:04
  • @Sohail Si: Not necessarily. As OrangeDog said, this is probably compiled into a conditional move/store. – AVH Jun 21 '16 at 14:53
  • @OrangeDog Does a compiler actually do such a thing? – Sohail Si Jun 21 '16 at 14:55
  • @Darhuuk Nope. I tried four different compilers, and none of them emit a conditional move. Inlining *might* change that, if the surrounding code changes the heuristics, but there is no guarantee. The *important* thing that OrangeDog said was to check and see. If you guess, you run the risk of being wrong. :-) – Cody Gray - on strike Jun 21 '16 at 14:55
  • @Cody Gray Checking is indeed the way to go. Great answer! – AVH Jun 21 '16 at 14:58
  • @SohailSi Latest GCC certainly has that optimisation (called `if-conversion2`) but I don't know under what precise circumstances it will use it. – OrangeDog Jun 21 '16 at 15:01
  • @OrangeDog Interesting. – Sohail Si Jun 21 '16 at 15:03
  • 2
    @SohailSi in fact, Cody's answer has some concrete examples of when various compilers will make that optimisation. – OrangeDog Jun 21 '16 at 15:34
  • @shrike: Yes, you should use `static inline` so the compiler doesn't waste space emitting a non-inline definition. That also allows inlining even when building a library for Unix/Linux, which support symbol interposition. Even calls within the library to global (non `static`) functions have to go through the PLT and can't be inlined. – Peter Cordes Jun 21 '16 at 18:43
3

Update: I reimplemented my answer.

In the following code the idea is converting the pointer into a number and multiplying it by a number (cond). Note inline used. Multiplication may help using an architecture that uses pipelining.

#include <cstdint>

template <typename T>
inline T* reset_if_true(T* p, bool cond) {
  void* ptr = (void*)p; // The optimising compiler (-O3) will get rid of unnecessary variables.
  intptr_t ptrint;
  // This is an unrecommended practice.
  ptrint = (intptr_t)ptr;
  ptrint = ptrint * cond;  // Multiply the integer
  void* ptr2 = (void*)ptrint;
  T* ptrv = (T*)ptr2;
  return ptrv;
}

Example usage:

#include <iostream>
#include <vector>

void test1(){
    //doulbe d = 3.141592;
    //typedef std::vector<double> mytype;
    std::vector<double> data = {3,1,4};
    auto ptr = &data;
    std::cout << (void*)ptr << std::endl;
    auto ptr2 = reset_if_true(ptr, 1);
    //auto ptr2 = (mytype*)reset_if_true(ptr, 1);
    std::cout << reset_if_true(ptr, 1) << " -> " << (*(reset_if_true(ptr, 1))).size() << std::endl;
    std::cout << reset_if_true(ptr, 2) << " -> "<< (*(reset_if_true(ptr, 2))).size() << std::endl;
    std::cout << reset_if_true(ptr, 0) <<
        " is null? " << (reset_if_true(ptr, 0) == NULL) <<  // Dont dereference a null.
        std::endl;
}


void test2(){
    double data = 3.141500123;
    auto ptr = &data;
    std::cout << (void*)ptr << std::endl;
    auto ptr2 = reset_if_true(ptr, 1);
    //auto ptr2 = (mytype*)reset_if_true(ptr, 1);
    std::cout << reset_if_true(ptr, 1) << " -> " << (*(reset_if_true(ptr, 1))) << std::endl;
    std::cout << reset_if_true(ptr, 2) << " -> "<< (*(reset_if_true(ptr, 2))) << std::endl;
    std::cout << reset_if_true(ptr, 0) <<
        " is null? " << (reset_if_true(ptr, 0) == NULL) <<  // Dont dereference a null.
        std::endl;

}

int main(){ test1(); test2(); }

Compile using these flags: -O3 -std=c++14. The output is:

0x5690
0x5690 -> 3
0x5690 -> 3
0 is null? 1
0x5690
0x5690 -> 3.1415
0x5690 -> 3.1415
0 is null? 1

It might have memory alignment problems when such options are used in the compiler commandline -s FORCE_ALIGNED_MEMORY=1 . Also see reinterpret_cast. Don't forget to use -O3.

The cond can be any non-zero value. There is some room for performance improvement here if we know it is not other than 0 or 1. In that case, you can use int another integer type for cond.

PS. This is an updated answer. The previous answer, as I already clearly mentioned in my answer, had issues. The solution is using intptr_t, and of course inline.

Compiler options used:

 em++ reset_if_true.cpp -O3 -std=c++14 -o reset_if_true.js
 node reset_if_true.js
Sohail Si
  • 2,750
  • 2
  • 22
  • 36
  • 1
    How could this be faster than just an `if` branch? At the very least you're throwing 4 more things onto the stack for each call... – Chris Sprague Jun 21 '16 at 13:47
  • 4
    Totally broken on any compiler with 32 bit long and 64 bit pointers. – gnasher729 Jun 21 '16 at 13:52
  • An answer with "never use this solution" and "this is unrecommended" gets -1. – jonspaceharper Jun 21 '16 at 13:55
  • 1
    Fair enough. I know, but another solution using the same idea may exist. It may help finding the correct answer. PS, using Null pointers is not recommended anyway. – Sohail Si Jun 21 '16 at 13:56
  • 1
    @ChrisSprague It can be. An optimising compiler removes unnecessary variables. Try pasting it here: `http://mbebenita.github.io/WasmExplorer/` – Sohail Si Jun 21 '16 at 13:57
  • It also can have the alignment problem. – Sohail Si Jun 21 '16 at 14:06
  • @gnasher729 Yes it is broken. But I said it. It is also limited to a flat memory model. It was not the case in old XT/AT PCs – Sohail Si Jun 21 '16 at 14:09
  • I fixed the problem using `intptr_t`. You should include `#include `. I have not tested it yet though. – Sohail Si Jun 21 '16 at 14:10
  • @gnasher729 now the answer is improved. Please remove your downvote if you agree with the new answer. (I still would not recommend this type of performance improvement, but it was asked if there is a way without the branch, and this is it). – Sohail Si Jun 21 '16 at 14:47
  • @JonHarper The answer is changed. – Sohail Si Jun 21 '16 at 15:08
  • I was the first person who suggested `reinterpret_cast`, `intptr_t` and `inline`, and I ended up getting 4 downvotes. – Sohail Si Jun 21 '16 at 15:09
  • 2
    Multiply is actually not bad as a way to conditionally zero based on a boolean. It's not pipelining that's needed for this to be good (although all CPUs are pipelined anyway), it's a low-latency multiplier (assuming the pointer result is on the critical path). Modern Intel CPUs have 3 cycle `imul r64, r64`, so this is potentially very good if the `bool` actually exists as a 0 or 1 integer, rather than just a flag condition after inlining. If the bool is produced at the site of inlining, it's going to be better to use a `cmov` (at worst 2 cycle latency, 1c on Broadwell and later, and AMD) – Peter Cordes Jun 21 '16 at 18:49
  • 1
    perf numbers from [Agner Fog's tables](http://agner.org/optimize/), and see also other links in the [x86 tag wiki](http://stackoverflow.com/tags/x86/info). BTW, you might want to use `uintptr_t`. unsigned integers are the normal way to make sure nothing weird happens, although signed multiply by 0 or 1 is fine. – Peter Cordes Jun 21 '16 at 18:51
  • 1
    I considered multiply as a conditional-zeroing method [in an answer a while ago](http://stackoverflow.com/questions/34407437/what-is-the-efficient-way-to-count-set-bits-at-a-position-or-lower/34410357#34410357). In that case, it was ok when I couldn't get gcc to emit anything better, but the OP and I eventually came up with a way to hand-hold gcc into making ideal asm. (And IIRC, clang optimized the multiply to the same good asm as a ternary operator.) – Peter Cordes Jun 21 '16 at 18:54
  • 1
    Good, I'm glad someone found them useful :) – Peter Cordes Jun 22 '16 at 12:13