0

I'm trying to understand what this testb instruction (x86-64) will do.

testb $1, %al

What is the value of $1 here. is it all ones (0xFF) or a single 1 (0x1)?

The assembly is produced by clang for the following program:

#include <atomic>
std::atomic<bool> flag_atomic{false};

extern void f1();
extern void f2();

void foo() {
    bool b = flag_atomic.load(std::memory_order_relaxed);
    if (b == false) {
        f1();
    } else {
        f2();
    }   
}

The relevant assembly with (clang++ -s test.cpp -O3) is the following:

Lcfi2:
  .cfi_def_cfa_register %rbp
  movb  _flag_atomic(%rip), %al 
  testb $1, %al ; <<<<------------
  jne LBB0_2
A. K.
  • 34,395
  • 15
  • 52
  • 89
  • 1
    `b` operand-size is *byte*, not *bit*, so `$1` is a byte with only the low bit set. Weird that clang uses a separate load instead of `test` with a memory operand; the function doesn't use the value again so no need to have it in a register. Or actually, an [immediate + RIP-relative wouldn't micro-fuse](https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes) on Intel. It might also fail to macro-fuse, so you could end up with 3 fused-domain uops on Intel from `test $1, _flag_atomic(%rip) / jnz`. – Peter Cordes May 02 '18 at 13:56
  • Note that `movb` is already a micro-fused load + merge into the low byte of RAX, though, [on recent Intel](https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to). `movzbl` would avoid the false dependency. So clang already has 3 unfused-domain uops (2 of them from the `movb`), but only 2 total fused-domain uops this way. – Peter Cordes May 02 '18 at 13:57
  • Yes, for some reason, this only happens with atomic_load, for a scalar load clang generates `cmpb` without using `%al`. – A. K. May 02 '18 at 14:16
  • 1
    Ah, yeah `atomic` gets treated specially inside the compiler, and (like `volatile`) sometimes ends gimping the optimizer. Current compilers don't even optimize away repeated atomic loads, although the current standard does allow that: [Can and does the compiler optimize out two atomic loads?](//stackoverflow.com/q/41820539) and especially my answer on [Why don't compilers merge redundant std::atomic writes?](//stackoverflow.com/q/45960387) – Peter Cordes May 02 '18 at 14:22
  • FYI, http://godbolt.org/ is really nice for playing with compiler asm output (e.g. your code: https://godbolt.org/g/oNarvk) with different options / compilers / versions. – Peter Cordes May 02 '18 at 14:24
  • 1
    Thanks for sharing the resources. Very useful! I regularly use godbolt btw. PS: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85610 – A. K. May 02 '18 at 14:32

2 Answers2

3

In AT&T syntax $ is the prefix for immediate values (see also); $1 is a plain 1, so your instruction sets the flags according to the least significant bit of al.

is it all ones (0xFF) or a single 1 (0x1)?

All ones would be

testb $-1, %al

or (exact same machine code, just disassembly preference)

testb $0xff, %al

which incidentally would have the exact same semantic as

testb %al, %al

(as a mask of 0xff over an 8 bit register doesn't mask anything), and in this case is also valid for your code, as for a boolean there should be no need to mask anything out to check if it's true (and indeed gcc prefers this last version for your code).


movb  _flag_atomic(%rip), %al 
testb $1, %al
jne LBB0_2

in Intel syntax (no prefixes, no suffixes, dest, source operands order, explicit memory addressing syntax) this is

mov al, [rip+_flag_atomic]
test al, 1
jne LBB0_2

And, in pseudo-C:

%al = _flag_atomic;
if(%al & 1 != 0) goto LBB0_2;

(jne is an alias of jnz, which is probably more clear in this case).

Matteo Italia
  • 123,740
  • 17
  • 206
  • 299
0

0x is prefixed with hexadecimal number. 0 is prefixed with Octal number and if you do not mentioned any prefix it would be decimal number system. In your case this is 1 in decimal number system.

Sandeep
  • 333
  • 2
  • 7