x86 - instruction interleaving to avoid cpu stall

Question

Gcc6 - intel core 2 duo. Compilation flags: "-march=native -O3" (-S)

I was compiling a simple program and asked for the assembly output:

Code

movq    8(%rsi), %rdi
call    _atoi
movq    16(%rbp), %rdi
movl    %eax, %ebx
call    _atof
pxor    %xmm1, %xmm1
movl    $1, %eax <- this instruction is my problem
cvtsi2sd    %ebx, %xmm1
leaq    LC0(%rip), %rdi
addsd   %xmm1, %xmm0
call    _printf
addq    $8, %rsp

Execution

read/convert an integer variable, then read/convert a double value and add them.

The problem

I perfectly understand that one (the compiler more so) has to avoid cpu stalls as much as possible.

I've shown the offending instruction in the code section above. To me, with cpu reordering, and different execution context, this interleaved instruction is useless.

My rationale is: chances that we stall are very high anyway and the cpu will wait for pxor xmm1 to return before being able to reuse it in the next instruction. Adding an instruction will just fill the cpu decoder for nothing. The cpu HAS to wait anyway. So why not leaving it alone for 1 instruction?

Moving the pxor before atof seems not possible as atof may use it.

Question

Is that a bug, a legacy junk (when cpu were not able to reorder) or.. else?

Thanks

EDIT:

I admit my question was not clear: can this instruction be safely removed without performance consequences?

`pxor xmm1, xmm1` is essentially a 0-latency instruction though — harold, Jun 19 '16 at 10:52
You are right, that is even worst! (even if not a big deal all in all) — Larry, Jun 19 '16 at 10:56
Well, the reordering does look fairly useless. But not harmful afaict — harold, Jun 19 '16 at 10:57
If you care so much about micro-optimization, why are you using `printf`? — , Jun 19 '16 at 12:56
@Larry: just write a function that takes args and returns a value. Remember, you don't want to run it, just look at the code to see what the compiler does. e.g. like this on [the Godbolt compiler explorer](https://godbolt.org/g/wwpzM0). BTW, did you remember to `#include `? Because `atoi` compiles to a call to `strtol` with glibc on Linux, unless I leave out the header so it's implicitly declared. — Peter Cordes, Jun 20 '16 at 01:03
@PeterCordes: yes i did add the header file. Godbolt is indeed a very good resource. Thanks Peter. Thanks a lot. — Larry, Jun 21 '16 at 09:29

score 1 · Accepted Answer · edited May 23 '17 at 12:16

The x86-64 ABI requires that calls to varargs functions (like printf) set %al = the count of floating-point args passed in xmm registers. In this case, you're passing one double, so the ABI requires %al = 1. (Fun fact: C's promotion rules make it impossible to pass a float to a vararg function. This is why there are no printf conversion specifiers for float, only double.)

mov $1, %eax avoids false dependencies on the rest of eax, (compared to mov $1, %al), so gcc prefers spending extra instruction bytes on that, even though it's tuning for Core2 (which renames partial registers).

Previous answer, before it was clarified that the question was why the `mov` is done all, not about its ordering.

IIRC, gcc doesn't do much instruction scheduling for x86, because it's assuming out-of-order execution. I tried to google that, but didn't find the quote from a gcc developer that I seem to remember reading (maybe in a gcc bug report comment).

Anyway, it looks ok to me, unless you're tuning for in-order Atom or P5. If you are, use gcc -O3 -march=atom (which implies -mtune=atom). But anyway, you're clearly not doing that, because you used -march=native on a C2Duo, which is a 4-wide out-of-order design with a fairly large scheduler.

To me, with cpu reordering, and different execution context, this interleaved instruction is useless.

I have no idea what you think the problem is, or what ordering you think would be better, so I'll just explain why it looks good.

I didn't take the time to edit this down to a short answer, so you might prefer to just read Agner Fog's microarch pdf for details of the Core2 pipeline, and skim this answer. See also other links from the x86 tag wiki.

...
call    _atof
   # xmm0 is probably still not ready when the following instructions issue
pxor    %xmm1, %xmm1          # no inputs, so can run any time after being issued.

gcc uses pxor because cvtsi2sd is badly designed, giving it a false dependency on the previous value of the vector register. Note how the upper half of the vector register keeps its old value. Intel probably designed it this way because the original SSE cvtsi2ss was first implemented on Pentium III, where 128b vectors were handled as two halves. Zeroing the rest of the register (including the upper half) instead of merging probably would have taken an extra uop on PIII.

This short-sighted design choice saddled the architecture with the choice between an extra dependency-breaking instruction, or a false dependency. A false dep might not matter at all, or might be a big slowdown if the register used by one function happened to be used for a very long FP dependency chain in another function (maybe including a cache miss).

On Intel SnB-family CPUs, xor-zeroing is handled at register-rename time, so the uop never needs to execute on an execution port; it's already completed as soon as it issues into the ROB. This is true for integer and vector registers.

On other CPUs, the pxor will need an execution port, but has no input dependencies so it can execute any time there's a free ALU port, after it issues.

movl    $1, %eax             # no input dependencies, can execute any time.

This instruction could be placed anywhere after call atof and before call printf.

cvtsi2sd    %ebx, %xmm1       # no false dependency thanks to pxor.

This is a 2 uop instruction on Core2 (Merom and Penryn), according to Agner Fog's tables. That's weird because cvtsi2ss is 1 uop. (They're both 2 uops in SnB; presumably one uop to move data between integer and vector, and another for the conversion).

Putting this insn earlier would be good, potentially issue it a cycle earlier, since it's part of the longest dependency chain here. (The integer stuff is all simple and trivial). However, printf has to parse the format string before it will decide to look at xmm0, so the FP instructions aren't actually on the critical path.

It can't go ahead of pxor, and call / pxor / cvtsi2sd would mean pxor would decode by itself that cycle. Decoding will start with the instruction after the call, after the ret in the called function has been decoded (and the return-address predictor predicts the jump back to the insn after the call). Multi-uop instructions have to be the first instruction in a block, so having pxor and mov imm32 decode that cycle means less of a decode bottleneck.

leaq    LC0(%rip), %rdi        # 1 uop
addsd   %xmm1, %xmm0           # 1 uop
call    _printf                # 3 uop insn

cvtsi2sd/lea/addsd can all decode in the same cycle, which is optimal. If the mov imm32 was after the cvt, it could decode in the same cycle as well (since pre-SnB decoders can handle up to 4-1-1-1), but it couldn't have issued as soon.

If decoding was only barely keeping up with issue, that would mean pxor would issue by itself (because no other instructions were decoded yet). Then cvtsi2sd/mov imm/lea (4 uops), then addsd / call (4 uops). (addsd decoded with the previous issue group; core2 has a short queue between decode and issue to help absorb decode bubbles like this, and make it useful to be able to decode up to 7 uops in a cycle.)

That's not appreciably different from the current issue pattern in a decode-bottleneck situation: (pxor / mov imm) / (cvtsi2sd/lea/addsd) / (call printf)

If decode isn't the bottleneck, I'm not sure if Core2 can issue a ret or jmp in the same cycle as uops that follow the jump. In SnB-family CPUs, an unconditional jump always ends an issue group. e.g. a 3-uop loop issues ABC, ABC, ABC, not ABCA, BCAB, CABC.

Assuming the instructions after the ret issue with a group not including the ret, we'd have

(pxor/mov imm/cvtsi2sd), (lea / addsd / 2 of call's 3 uops) / (last call uop)

So the cvtsi2sd still issues in the first cycle after returning from atof, which means it can get started executing right away. Even on Core2, where pxor takes an execution unit, the first of the 2 uops from cvtsi2sd can probably execute in the same cycle as pxor. It's probably only the 2nd uop that has an input dependency on the dst register.

(mov imm / pxor / cvtsi2sd) would be equivalent, and so would the slower-to-decode (pxor / cvtsi2sd / mov imm), or getting the lea executed before mov imm.

Thanks Peter, I read those resources already. I also have agner's bible right close to me:) My issue was the existence of the instruction "movl $1, %eax", at all. It seems useless and even if it doesn't cost much, it is still some useless work. Isn't it? — Larry, Jun 19 '16 at 12:39
@Larry: sigh, I wish your question was more clear that you thought the instruction was not needed at all, rather than stuff about reordering. Updated with why it's needed. Especially your conversation with harold in comments made me think you were just asking about the ordering. Now that I know what you're asking, it's clear, but before that it sounded like you thought the reordering was useless. — Peter Cordes, Jun 19 '16 at 12:54
Also, if you're familiar with Agner Fog's microarch PDF, why do you think a Core2 would stall running this code? Like I pointed out, `printf` won't use its `xmm0` right away, so it's not on the critical path. There's definitely nothing that will stall the whole pipeline. (The whole point of out-of-order execution is to keep the execution units busy and hide the latency of short parallel dependency chains.) Having one instruction retire later than others is *not* a stall. — Peter Cordes, Jun 19 '16 at 13:00
You are right, my expression was bad. Actually, it won't stall since it is decoded in a different port; but this instruction is useless, right? — Larry, Jun 19 '16 at 16:29
@Larry: gcc doesn't emit useless instructions, other than NOP for padding for alignment. See my edit that added a new first couple of paragraphs. — Peter Cordes, Jun 19 '16 at 19:38
Years later: [Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?](https://stackoverflow.com/q/60688348) is an example of the false dependency causing a problem, and my answer there is (hopefully) a better version of the explanation here of the problems caused by that short-sighted SSE design decision for PIII. — Peter Cordes, Apr 30 '21 at 12:07

x86 - instruction interleaving to avoid cpu stall

Code

Execution

The problem

Question

EDIT:

1 Answers1

Previous answer, before it was clarified that the question was why the `mov` is done all, not about its ordering.

Linked

x86 - instruction interleaving to avoid cpu stall

Code

Execution

The problem

Question

EDIT:

1 Answers1

Previous answer, before it was clarified that the question was why the mov is done all, not about its ordering.

Linked

Previous answer, before it was clarified that the question was why the `mov` is done all, not about its ordering.