2

While fiddling with simple C code, I noticed something strange. Why does ICC produces incl %eax in assembly code generated for increment instead of addl $1, %eax? GCC behaves as expected though, using add.

Example code (-O3 used on both GCC and ICC)

int A, B, C, D, E;

void foo()
{
    A = B + 1;
    B = 0;
    C++;
    D++;
    D++;
    E += 2;
}

Result on ICC

L__routine_start_foo_0:
foo:
    movl      B(%rip), %eax                                 #5.13
    movl      D(%rip), %edx                                 #8.9
    incl      %eax                                          #5.17
    movl      E(%rip), %ecx                                 #10.9
    addl      $2, %edx                                      #9.9
    addl      $2, %ecx                                      #10.9
    movl      %eax, A(%rip)                                 #5.9
    movl      $0, B(%rip)                                   #6.9
    incl      C(%rip)                                       #7.9
    movl      %edx, D(%rip)                                 #9.9
    movl      %ecx, E(%rip)                                 #10.9
    ret   

For example, see here.

As such, I'm wondering - is this an intended feature, a bug or some quirk resulting from some specific setting? If add is (supposedly) better due to flags update or efficiency (which is the conclusion based on the links below) - why does ICC use inc?

Related:

Relative performance of x86 inc vs. add instruction

Is ADD 1 really faster than INC ? x86

GCC doesn't make use of inc

Note:

I'm asking this question explicitly because none of the questions I found or was directed to on SO does explain this behaviour. My previous question concerning this matter got closed because, supposedly, it's trivial and has been answered. I don't find it trivial. I didn't find an answer in all of the links and answers given. It's not another "how to plug my mouse into my PC" problem. All of the questions explain why add is/could be better on new x86 processors or why GCC uses it, but none concerns ICC.

Any insight on ICC design choices would be also very welcome.

PS I don't consider "it does it because it does" a valid answer.

Community
  • 1
  • 1
  • 1
    It would help if you included the C source and assembly listing in your question. Why would `inc` be incorrect? C source code specifies *behavior*; as long as the program behaves correctly, it doesn't matter (as far as the C standard is concerned) what instructions are used to achieve that behavior. But yes, there might be reasons to prefer `addl` over `incl`; it would also be helpful if you'd cite some sources that explain why – Keith Thompson Sep 15 '14 at 17:24
  • 1
    Don't keep the compiler options you selected a secret. You already know that optimizing for size makes INC likely to be used. – Hans Passant Sep 15 '14 at 17:34
  • The flags issue is only an issue if you follow it up with a branch that reads the carry flag, which is unlikely. – harold Sep 15 '14 at 17:36
  • 1
    As I understand it, icc is closed source, which might make it very difficult to get any insight about the design choices that went into it. – Keith Thompson Sep 15 '14 at 17:44
  • BTW, along with `inc`, `dec` is prefered over `sub`. Does anyone know whether `icc` must apply that operation as reported in the document? – edmz Sep 15 '14 at 18:13
  • 1
    Sandy Bridge was the first processor that did something about the partial flag stall that INC suffers from. It still isn't clear what micro-architecture is being targeted. – Hans Passant Sep 15 '14 at 18:42
  • 3
    [`inc/dec` was slow on P4, but not on anything else](http://stackoverflow.com/a/36510865/224132). `inc` itself doesn't cause a partial flag stall, only reading `CF` after an `inc`. – Peter Cordes Apr 09 '16 at 00:05

1 Answers1

2

It is not unreasonable to assume at this point that incl was selected as it takes only one byte (0x40) instead of three (0x83 0xc0 0x01).

Sparky
  • 13,505
  • 4
  • 26
  • 27
  • 1
    Perhaps some Intel processors are fast on `incl` (while AMD could be slower?)? – Basile Starynkevitch Sep 15 '14 at 17:44
  • There was a time when lea eax,[1+eax] was faster, but that may have only been true on 16 bit processor so only lea ax,[1+ax]. – rcgldr Sep 16 '14 at 01:45
  • The one-byte inc/dec has been deprecated in x86_64 (which is the case of the OP), because those opcodes were used for [REX prefix](http://wiki.osdev.org/X86-64_Instruction_Encoding#REX_prefix) – phuclv Mar 06 '15 at 18:21
  • `inc eax` must be encoded s `FF C0` in x86_64, which is not 1 byte, but still shorter than `add eax, 1` – phuclv Mar 06 '15 at 18:33