Consider this C code:
int f(void) {
int ret;
char carry;
__asm__(
"nop # do something that sets eax and CF"
: "=a"(ret), "=@ccc"(carry)
);
return carry ? -ret : ret;
}
When I compile it with gcc -O3
, I get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
negl %edx
testb %cl, %cl
cmovne %edx, %eax
ret
If I change char carry
to int carry
, I instead get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
movzbl %cl, %ecx
negl %edx
testl %ecx, %ecx
cmovne %edx, %eax
ret
That change replaced testb %cl, %cl
with movzbl %cl, %ecx
and testl %ecx, %ecx
. The program is actually equivalent, though, and GCC knows it. As evidence of this, if I compile with -Os
instead of -O3
, then both char carry
and int carry
result in the exact same assembly:
f:
nop # do something that sets eax and CF
jnc .L1
negl %eax
.L1:
ret
It seems like one of two things must be true, but I'm not sure which:
- A
testb
is faster than amovzbl
followed by atestl
, so GCC's use of the latter withint
is a missed optimization. - A
testb
is slower than amovzbl
followed by atestl
, so GCC's use of the former withchar
is a missed optimization.
My gut tells me that an extra instruction will be slower, but I also have a nagging doubt that it's preventing a partial register stall that I just don't see.
By the way, the usual recommended approach of xor
ing the register to zero before the setc
doesn't work in my real example. You can't do it after the inline assembly runs, since xor
will overwrite the carry flag, and you can't do it before the inline assembly runs, since in the real context of this code, every general-purpose call-clobbered register is already in use somehow.