3

Can a single x86 instruction toggle a boolean value between '0' and '1'?

I thought of following ways but all result in two instruction with -O3 flag of gcc.

status =! status;

status = 1 - status;

status  = status == 0 ? 1: 0;

int flip[2] = {1, 0};
status = flip[status];

Is there a faster way to do this?

This is what I tried: https://godbolt.org/g/A3qNUw


What I need is a function which toggles the input and returns, written in a way that compiles to one instruction. Something similar to this function:

int addOne(int n) { return n+1; }

compiles on Godbolt to this:

  lea eax, [rdi+1]    # return n+1 in a single instruction
  ret
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 5
    `status ^=1;` This is one way. – user2736738 Mar 01 '18 at 18:13
  • 3
    What two instructions? What CPU? – Eugene Sh. Mar 01 '18 at 18:15
  • 1
    The bit flipping itself is surely single instruction, i.e. https://godbolt.org/g/XznrRA (the second instruction is resolving calling convention). So your premise in question is wrong, and your question doesn't make sense. Add context, where you want single instruction, what you tried, and how it failed. – Ped7g Mar 01 '18 at 18:21
  • `xor` exists in every architecture. If you don't know this instruction, pick up a good book to learn some basic assembly. – llllllllll Mar 01 '18 at 18:26
  • In your godbolt example you can do `return status^1;` .. it will look as two instructions on x86_64 target, because again one is resolving calling convention, but the important one is `xor eax,1` and that one will be inlined when used in real code. If you are using `uint8` data type, then `data ^ ` is single `xor` instruction (when it is in stand-alone function and not inlined, it will have additional `mov` and `ret`, i.e. 3 instructions, but gcc will inline it unless you prevent that). – Ped7g Mar 01 '18 at 18:26
  • Edited the question. I mean two instruction vs one to complete the job not just the flipping part. "status -1" for example is just one 'lea' instruction. I hope I am making sense here. – Bhupendra dubey Mar 01 '18 at 18:27
  • Perhaps `xor value,8` for bit 3. But why do you need a function for such a simple operation? – Weather Vane Mar 01 '18 at 18:41
  • 1
    About edit2: you can consider your example with `+1` as sort of bit-flip too, but with damaging upper bits (you can flip any bit by doing `+that_bit_value`, when you don't care about bits above it being modified. But I don't think there's single instruction SYSTEM V ABI function flipping only the desired bit, because the argument comes in `rdi/edi/...` and result is returned in `rax/eax/...` (and `ret`!). But that's weird requirement, because that's not something you would need in real world production, so whatever you are trying to achieve is purely artificial and irrelevant to real world SW. – Ped7g Mar 01 '18 at 19:24
  • @Ped7g: exactly right. After inlining, you'll just get an `xor` without a `mov`, if the compiler doesn't also need / want to keep the old unflipped value. If the calling convention passed the first arg in `eax`, you'd be all set. (e.g. ARM's standard calling convention does this: first arg in `r0`, return in `r0`. So even in Thumb mode (where 3-operand `eor r3, r2, #1` isn't available), `eor r0, #1` / `bx lr` is all you need, no copying from arg-passing register to return-value register. x86 doesn't have a copy-and-xor instruction, only copy-and-add, and a few BMI instructions. – Peter Cordes Mar 01 '18 at 19:44

3 Answers3

12

To flip a bit in an integer, use xor like this: foo ^= 1.

gcc knows this optimization already for bool, so you can return !status; like a normal person without losing any efficiency. gcc does compile status ^= 1 to an xor instruction as well. In fact, all your ideas except the table lookup compile to a single xor instruction with bool input / return value.

Check it out on the Godbolt compiler explorer with gcc -O3, with asm output panes for bool and int.

MYTYPE func4(MYTYPE status) {
    status ^=1;
    return status;
}

  # same code for bool or int
  mov eax, edi
  xor eax, 1
  ret

vs.

MYTYPE func1(MYTYPE status) {
    status = !status;
    return status;
}

  # with -DMYTYPE=bool
  mov eax, edi
  xor eax, 1
  ret

  # with int
  xor eax, eax
  test edi, edi
  sete al
  ret

Semi-related: XOR is add-without-carry. So if you only care about the low bit, you can copy-and-flip-low with lea eax, [rdi+1]. See Check if a number is even where that's useful in combination with and eax, 1 to get it done in 2 instructions.


Why is bool different from int?

The x86-64 System V ABI requires that callers passing a bool pass a 0 or 1 value, not just any non-zero integer. Thus, the compiler can assume that about the input.

But with int foo, the C expression !foo requires "booleanizing" the value. !foo has type _Bool / (aka bool if you #include <stdbool.h>), and converting that back to an integer must produce a value of 0 or 1. If the compiler doesn't know that foo must be 0 or 1, it can't optimize !foo to foo^=1, and can't realize that foo ^= 1 flips a value between truthy / falsy. (In the sense that if(foo) means if(foo != 0) in C).

This is why you get test/setcc (zero-extended into a 32-bit int by xor-zeroing a register before the test).

Related: Boolean values as 8 bit in compilers. Are operations on them inefficient?. Stuff like (bool1 && bool2) ? x : y isn't always compiled as efficiently as you might hope. Compilers are pretty good, but do have missed-optimization bugs.


What about that extra mov instruction?

It will go away when inlining, if the compiler doesn't need / want to keep the old un-flipped value around for later. But in a stand-alone function, the first arg is in edi, and the return value needs to be in eax (in the x86-64 System V calling convention).

Tiny functions like this are a close approximation to what you might get as part of a large function (if this flip couldn't be optimized into something else), but needing the result in a different register is a confounding factor.


x86 doesn't have a copy-and-xor integer instruction, so for a stand-alone function it will take at least a mov to copy from the arg-passing register to eax.

lea is special: it's one of the few integer ALU instructions that can write the result to a different register instead of destroying its input. lea is a copy-and-shift/add instruction, but there's no copy-and-xor instruction in x86. Many RISC instruction sets have 3-operand instructions, for example MIPS could do xor $t1, $t2, $t3.

AVX introduced non-destructive versions of vector instructions (saving a lot of movdqa / movups register-copying in a lot of code), but for integer there are only a few new instructions that do different things. rorx eax, ecx, 16 for example does eax = rotate_right(ecx, 16), and uses the same VEX encoding that non-destructive AVX instructions use.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
4

From this code run of Godbolt (This code basically contains few of the options I tried) it seems XORing gives a one statement which can do it:-(As you said toggling is what you are looking for)

status ^= 1;

boils down to just single instruction of (this was with -O0)

xor DWORD PTR [rbp-4], 1

With -O3 you can see all the methods you mentioned use xor anf this inparticular comes to mov eax, edi/xor eax, 1.

And this ensures the state being toggled to and fro from 0 to 1 and vice versa. (Because there is xor statement - which is there in most architectures and useful in many cases).

I have let the other option of memory access fall through - because pointer arithmetic and dereferecing the address would not be faster than these ones (have possible memory access).

I have suggested one way of doing based on the small messing around in godbolt. What you can do from here is - compare different ways of doing it and then get a result of time you are getting. Supposedly, the result you will get XOR-ing won't be that bad on your machine's architecture.

Interestingly as Peter Cordes in the example showed that this would hold true for booleans also.

With this example it is clear that compiler optimizes to the unoptimized code's xoring with 1 version. This is one way supports the fact that xoring would yield better result in case of normal int operation. With booleans when compiled using -O3 all the ones shown above trickles to mov eax, edi/xor eax, 1.

user2736738
  • 30,591
  • 5
  • 42
  • 56
  • Yes, `x ^= 1` is the right answer, but compiling without optimization is not a good way to demonstrate. Interestingly, gcc optimizes this even for `bool`, so a version that takes an arg and returns a value compiles to `mov eax,edi` / `xor eax,1`. https://godbolt.org/g/7uM1XW. Note that I also fixed your `func1` to flip a global variable, so it can optimize without optimizing away. – Peter Cordes Mar 01 '18 at 19:31
  • @PeterCordes.: Thanks for the comment - would you let me know the meaning of last comment? *flip a global variable, so it can optimize without optimizing away* ? – user2736738 Mar 01 '18 at 19:35
  • Look at my Godbolt link. I made `status` a global instead of a local with automatic storage, so `func1` has a side effect that can't optimize away with `-O3`. If you're ok with forcing the compiler to update memory instead of a register, global variables are convenient. – Peter Cordes Mar 01 '18 at 19:38
  • @PeterCordes.: I did look at it - I get it - you meant if it was not made global then gcc wouldn' even do that thing - as it has no side effect - now it is bound to do that as it is a global - so -O3 cant touch it. – user2736738 Mar 01 '18 at 19:41
  • Yes, try adding `-O3` to your version so it doesn't compile to garbage code that does each C statement separately. `gcc -O0` spills / reloads between statements so you can change `status` with a debugger. – Peter Cordes Mar 01 '18 at 19:47
  • @PeterCordes.: I have done [this](https://godbolt.org/g/wLUDaL) to make it more clear. Please check once and let me know if I am missing something. – user2736738 Mar 01 '18 at 19:52
  • Yes, breaking it into separate functions is much better. But that allows you to do even better: It would make more sense to take function args and return a value, like the OP showed in the last edit. That's the normal way to see how gcc optimizes a tiny snippet without any surrounding code. – Peter Cordes Mar 01 '18 at 19:54
  • @PeterCordes.: That's a good suggestion. I edited. And yes all of the functions when optimizing follow the xor-ing. It's interesting. [This](https://godbolt.org/g/fWXWRd) what I did. – user2736738 Mar 01 '18 at 19:58
  • In most of your functions, you're forcing the compiler to booleanize an `int`, so it can't optimize with `xor reg, 1`. Instead it's `xor`-zeroing `eax` before a test/`setcc al`. Because the way you wrote it, C semantics demand a 0 / 1 output even if the intput `int` was `5` or `-100` or whatever. It would be more interesting to see how well gcc does with `bool` inputs, IMO; to see whether or not it comes up with the `xor eax,1` "on its own" if you write `return !status` like a normal person. – Peter Cordes Mar 01 '18 at 20:05
  • Update: yup, gcc knows this optimization for `!status` with `bool status`. https://godbolt.org/g/tsJgWw – Peter Cordes Mar 01 '18 at 20:08
  • That's not important (when you compile with optimization enabled). The important part is looking at the difference between making `status` a `bool` vs. an `int`. See my last Godbolt link. – Peter Cordes Mar 01 '18 at 20:10
  • @PeterCordes.: So that means when it is sure tha it will be `0/1` or of type bool it is optimizing all to that one.(It's same). So this proves in another way that doing the xor is better as the optimized bool version boils down to unoptimized code's `x^=1` part? – user2736738 Mar 01 '18 at 20:10
  • @PeterCordes.: I have edited my answer to incorporate the finding that you mentioned. If possible let me know. – user2736738 Mar 01 '18 at 20:18
3

If you are down to trying to micro-optimize boolean operations, you are either prematurely optimizing or you are doing a lot of operations on a lot of boolean data. For the former - The answer is don't; for the latter, you may be asking the wrong question. If the real question is how do I optimize (many) operations on (much) boolean data, the answer is to use an alternative representation based on "flags" (a.k.a. use a better algorithm). This will allow you to portably and readably fit more data into cache and perform multiple operations and tests simultaneously.

Why/How is this better?

Cache

Consider a system where the cache line size is 64 bytes. 64 _Bool will fit into the data cache line whereas 8 times that amount will fit. You'll likely have smaller instruction code as well - ranging from 1 additional instruction to 32x fewer. This can make a large difference in tight loops.

Operations

Most operations involve one or two (usually very fast) operations and a single test regardless of how many flags you are testing. Since this can incorporate multiple values simultaneously, each operation can do (typically 32 or 64 times) more work.

Branching

Since multiple operations and tests can be completed simultaneously, what would have been up to 32 (or 64) possible branches can be reduced to one. This can reduce branch mispredictions.

Readability

By using a well named mask constant, a complex nested if-else-if-else block can be reduced to a single readable line.

Portability

_Bool was not available in early versions of C and C++ uses different mechanisms for boolean; however, flags will work in older versions of C and is compatible with C++

Here is a practical example of how to set a mask with flags:

int isconsonant(int c){
    const unsigned consonant_mask = (1<<('b'-'a'))|
    (1<<('c'-'a'))|(1<<('d'-'a'))|(1<<('f'-'a'))|(1<<('g'-'a'))|
    (1<<('h'-'a'))|(1<<('j'-'a'))|(1<<('k'-'a'))|(1<<('l'-'a'))|
    (1<<('m'-'a'))|(1<<('n'-'a'))|(1<<('p'-'a'))|(1<<('q'-'a'))|
    (1<<('r'-'a'))|(1<<('s'-'a'))|(1<<('t'-'a'))|(1<<('v'-'a'))|
    (1<<('w'-'a'))|(1<<('x'-'a'))|(1<<('y'-'a'))|(1<<('z'-'a'));
    unsigned x = (c|32)-'a'; // ~ tolower
    /* if 1<<x is in range of int32 set mask to position relative to `a`
     * as in the mask above otherwise it is set to 0 */
    int ret = (x<32)<<(x&31);
    return ret & consonant_mask;
}
//compiles to 7 operations to check for 52 different values
isconsonant:
  or edi, 32 # tmp95,
  xor eax, eax # tmp97
  lea ecx, [rdi-97] # x,
  cmp ecx, 31 # x,
  setbe al #, tmp97
  sal eax, cl # ret, x
  and eax, 66043630 # tmp96,
  ret

This concept can be used to simultaneously operate on a simulated array of boolean values using something like:

//inline these if your compiler doesn't automatically
_Bool isSpecificMaskSet(uint32_t x, uint32_t m){
    return x==m; //returns 1 if all bits in m are exactly the same as x
}

_Bool isLimitedMaskSet(uint32_t x, uint32_t m, uint32_t v){
    return (x&m) == v;
    //returns 1 if all bits set in v are set in x
    //bits not set in m are ignored
}

_Bool isNoMaskBitSet(uint32_t x, uint32_t m){
    return (x&m) == 0; //returns 1 if no bits set in m are set in x
}

_Bool areAllMaskBitsSet(uint32_t x, uint32_t m){
    return (x&m) == m; //returns 1 if all bits set in m are set in x
}

uint32_t setMaskBits(uint32_t x, uint32_t m){
    return x|m; //returns x with mask bits set in m
}

uint32_t toggleMaskBits(uint32_t x, uint32_t m){
    return x^m; //returns x with the bits in m toggled
}

uint32_t clearMaskBits(uint32_t x, uint32_t m){
    return x&~m; //returns x with all bits set in m cleared
}

uint32_t getMaskBits(uint32_t x, uint32_t m){
    return x&m; //returns mask bits set in x
}

uint32_t getMaskBitsNotSet(uint32_t x, uint32_t m){
    return (x&m)^m; //returns mask bits not set in x
}
technosaurus
  • 7,676
  • 1
  • 30
  • 52
  • 2
    Fun fact: gcc knows how to optimize a `switch` statement into an immediate bitmap, for exactly the same kind of test as your consonant bitmap. https://stackoverflow.com/questions/97987/advantage-of-switch-over-if-else-statement#comment52559281_129515. Related: I used a consonant bitmap [in a code-golf answer](https://codegolf.stackexchange.com/questions/123194/user-appreciation-challenge-1-dennis/123458#123458), with a different method to reject non-letter ASCII codes. The `x<32` is interesting, but out-of-range shift counts are UB in C. gcc's asm does what we want in this case, though. – Peter Cordes Mar 02 '18 at 00:51
  • Nice, `(x<32) << (x&31)` should reliably compile to no extra instructions, on targets like x86 where the shift instructions mask the count (and gcc knows that). The "single readable line" is a bit of a stretch now, though. Maybe a comment like `// 1 << x in the alphabetic range`. Using a boolean->integer conversion as the left hand side of a shift is not an idiom I'd seen before (in C or asm). – Peter Cordes Mar 02 '18 at 06:48
  • Thanks @PeterCordes the `&31` does get optimized away on x86. FWIW I did have `(x<=('z'-'a'))` instead of `(x<32)` but used 32 to indicate that it was in range of `uint32_t` (plus it helped one of the compilers I tested it on with optimization) ... the additonal `&31` now just reiterates that. Using bool->int as the left side of a shift is one of the idiosyncrasies of my coding style that I picked up while optimizing some ctype.h functions to avoid branches and lookup tables. – technosaurus Mar 02 '18 at 08:02
  • Another thought: without BMI2 for single-uop variable-count shifts, on Broadwell and later CMOV is single uop, so the best asm probably uses `bt` to test a bit. i.e. `xor-zero` / `cmp` / `cmov` (x<32 ? consonant_mask : 0) / bt / jcc (or cmov or adc, or do whatever with the `is_consonant` result in CF). – Peter Cordes Mar 02 '18 at 08:35
  • For your method, `(x<32) & (consonant_bitmap >> (x&31))` would create more instruction-level parallelism, because the shift doesn't have to wait for the compare/setcc. As a bonus, it creates a 0 / 1 that works as a `bool`. – Peter Cordes Mar 02 '18 at 08:36