The x86-64 ABI requires that calls to varargs functions (like printf
) set %al
= the count of floating-point args passed in xmm registers. In this case, you're passing one double
, so the ABI requires %al = 1
. (Fun fact: C's promotion rules make it impossible to pass a float
to a vararg function. This is why there are no printf conversion specifiers for float
, only double
.)
mov $1, %eax
avoids false dependencies on the rest of eax
, (compared to mov $1, %al
), so gcc prefers spending extra instruction bytes on that, even though it's tuning for Core2 (which renames partial registers).
Previous answer, before it was clarified that the question was why the mov
is done all, not about its ordering.
IIRC, gcc doesn't do much instruction scheduling for x86, because it's assuming out-of-order execution. I tried to google that, but didn't find the quote from a gcc developer that I seem to remember reading (maybe in a gcc bug report comment).
Anyway, it looks ok to me, unless you're tuning for in-order Atom or P5. If you are, use gcc -O3 -march=atom
(which implies -mtune=atom
). But anyway, you're clearly not doing that, because you used -march=native
on a C2Duo, which is a 4-wide out-of-order design with a fairly large scheduler.
To me, with cpu reordering, and different execution context, this interleaved instruction is useless.
I have no idea what you think the problem is, or what ordering you think would be better, so I'll just explain why it looks good.
I didn't take the time to edit this down to a short answer, so you might prefer to just read Agner Fog's microarch pdf for details of the Core2 pipeline, and skim this answer. See also other links from the x86 tag wiki.
...
call _atof
# xmm0 is probably still not ready when the following instructions issue
pxor %xmm1, %xmm1 # no inputs, so can run any time after being issued.
gcc uses pxor
because cvtsi2sd
is badly designed, giving it a false dependency on the previous value of the vector register. Note how the upper half of the vector register keeps its old value. Intel probably designed it this way because the original SSE cvtsi2ss
was first implemented on Pentium III, where 128b vectors were handled as two halves. Zeroing the rest of the register (including the upper half) instead of merging probably would have taken an extra uop on PIII.
This short-sighted design choice saddled the architecture with the choice between an extra dependency-breaking instruction, or a false dependency. A false dep might not matter at all, or might be a big slowdown if the register used by one function happened to be used for a very long FP dependency chain in another function (maybe including a cache miss).
On Intel SnB-family CPUs, xor-zeroing is handled at register-rename time, so the uop never needs to execute on an execution port; it's already completed as soon as it issues into the ROB. This is true for integer and vector registers.
On other CPUs, the pxor
will need an execution port, but has no input dependencies so it can execute any time there's a free ALU port, after it issues.
movl $1, %eax # no input dependencies, can execute any time.
This instruction could be placed anywhere after call atof
and before call printf
.
cvtsi2sd %ebx, %xmm1 # no false dependency thanks to pxor.
This is a 2 uop instruction on Core2 (Merom and Penryn), according to Agner Fog's tables. That's weird because cvtsi2ss
is 1 uop. (They're both 2 uops in SnB; presumably one uop to move data between integer and vector, and another for the conversion).
Putting this insn earlier would be good, potentially issue it a cycle earlier, since it's part of the longest dependency chain here. (The integer stuff is all simple and trivial). However, printf
has to parse the format string before it will decide to look at xmm0
, so the FP instructions aren't actually on the critical path.
It can't go ahead of pxor
, and call
/ pxor
/ cvtsi2sd
would mean pxor
would decode by itself that cycle. Decoding will start with the instruction after the call
, after the ret
in the called function has been decoded (and the return-address predictor predicts the jump back to the insn after the call). Multi-uop instructions have to be the first instruction in a block, so having pxor
and mov imm32
decode that cycle means less of a decode bottleneck.
leaq LC0(%rip), %rdi # 1 uop
addsd %xmm1, %xmm0 # 1 uop
call _printf # 3 uop insn
cvtsi2sd
/lea
/addsd
can all decode in the same cycle, which is optimal. If the mov imm32
was after the cvt, it could decode in the same cycle as well (since pre-SnB decoders can handle up to 4-1-1-1), but it couldn't have issued as soon.
If decoding was only barely keeping up with issue, that would mean pxor
would issue by itself (because no other instructions were decoded yet). Then cvtsi2sd
/mov imm
/lea
(4 uops), then addsd
/ call
(4 uops). (addsd
decoded with the previous issue group; core2 has a short queue between decode and issue to help absorb decode bubbles like this, and make it useful to be able to decode up to 7 uops in a cycle.)
That's not appreciably different from the current issue pattern in a decode-bottleneck situation: (pxor
/ mov imm
) / (cvtsi2sd
/lea
/addsd
) / (call printf
)
If decode isn't the bottleneck, I'm not sure if Core2 can issue a ret
or jmp
in the same cycle as uops that follow the jump. In SnB-family CPUs, an unconditional jump always ends an issue group. e.g. a 3-uop loop issues ABC
, ABC
, ABC
, not ABCA
, BCAB
, CABC
.
Assuming the instructions after the ret
issue with a group not including the ret
, we'd have
(pxor
/mov imm
/cvtsi2sd
), (lea
/ addsd
/ 2 of call
's 3 uops) / (last call
uop)
So the cvtsi2sd still issues in the first cycle after returning from atof
, which means it can get started executing right away. Even on Core2, where pxor
takes an execution unit, the first of the 2 uops from cvtsi2sd
can probably execute in the same cycle as pxor
. It's probably only the 2nd uop that has an input dependency on the dst register.
(mov imm
/ pxor
/ cvtsi2sd
) would be equivalent, and so would the slower-to-decode (pxor
/ cvtsi2sd
/ mov imm
), or getting the lea
executed before mov imm
.