46

In examining the output of various compilers for a variety of code snippets, I've noticed that Intel's C compiler (ICC) has a strong tendency to prefer emitting a pair of NEG+ADD instructions where other compilers would use a single SUB instruction.

As a simple example, consider the following C code:

uint64_t Mod3(uint64_t value)
{
    return (value % 3);
}

ICC translates this to the following machine code (regardless of optimization level):

mov       rcx, 0xaaaaaaaaaaaaaaab
mov       rax, rdi
mul       rcx
shr       rdx, 1
lea       rsi, QWORD PTR [rdx+rdx*2]
neg       rsi                            ; \  equivalent to:
add       rdi, rsi                       ; /    sub  rdi, rsi
mov       rax, rdi
ret         

Whereas other compilers (including MSVC, GCC, and Clang) will all generate essentially equivalent code, except that the NEG+ADD sequence is replaced by a single SUB instruction.

Like I said, this isn't just a quirk of how ICC compiles this particular snippet. It's a pattern I've observed repeatedly when analyzing the disassembly for arithmetic operations. I normally wouldn't think much of this, except that ICC is known to be a pretty good optimizing compiler and it is developed by folks that have insider information about their microprocessors.

Could there be something that Intel knows about the implementation of the SUB instruction on their processors that makes it more optimal to decompose it into NEG+ADD instructions? Using RISC-style instructions that decode into simpler µops is well-known optimization advice for modern microarchitectures, so is it possible that SUB is broken down internally into individual NEG and ADD µops, and that it is actually more efficient for the front-end decoder to use these "simpler" instructions? Modern CPUs are complicated, so anything is possible.

Agner Fog's comprehensive instruction tables confirm my intuition, though, that this is actually a pessimization. SUB is equally as efficient as ADD on all processors, so the additional required NEG instruction just serves to slow things down.

I also ran the two sequences through Intel's own Architecture Code Analyzer to analyze the throughput. Though the exact cycle counts and port bindings vary from one microarchitecture to another, a single SUB appears to be superior in every respect from Nehalem to Broadwell. Here are the two reports generated by the tool for Haswell:

SUB
Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.85 Cycles       Throughput Bottleneck: Dependency chains (possibly between iterations)

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 1.5  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.8  | 1.7  | 0.0  |
---------------------------------------------------------------------------------------

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    | 0.1       | 0.2 |           |           |     | 0.3 | 0.4 |     | CP | mov rax, 0xaaaaaaaaaaaaaaab
|   2    |           | 1.0 |           |           |     |     | 1.0 |     | CP | mul rcx
|   1    | 0.9       |     |           |           |     |     | 0.1 |     | CP | shr rdx, 0x1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | lea rax, ptr [rdx+rdx*2]
|   1    |           | 0.3 |           |           |     | 0.4 | 0.2 |     | CP | sub rcx, rax
|   1*   |           |     |           |           |     |     |     |     |    | mov rax, rcx
Total Num Of Uops: 7
NEG+ADD
Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 2.15 Cycles       Throughput Bottleneck: Dependency chains (possibly between iterations)

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.1    0.0  | 2.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 2.0  | 2.0  | 0.0  |
---------------------------------------------------------------------------------------

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    | 0.1       | 0.9 |           |           |     | 0.1 | 0.1 |     |    | mov rax, 0xaaaaaaaaaaaaaaab
|   2    |           | 1.0 |           |           |     |     | 1.0 |     | CP | mul rcx
|   1    | 1.0       |     |           |           |     |     |     |     | CP | shr rdx, 0x1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | lea rax, ptr [rdx+rdx*2]
|   1    |           | 0.1 |           |           |     | 0.8 | 0.1 |     | CP | neg rax
|   1    | 0.1       |     |           |           |     | 0.1 | 0.9 |     | CP | add rcx, rax
|   1*   |           |     |           |           |     |     |     |     |    | mov rax, rcx
Total Num Of Uops: 8

So, as far as I can tell, NEG+ADD increases the code size, increases the number of µops, increases pressure for execution ports, and increases the number of cycles, thus resulting in a net decrease in throughput compared to SUB. So why is Intel's compiler doing this?

Is this just some quirk of the code generator that should be reported as a defect, or am I missing some merit in my analysis?

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
  • this may end up being primarily opinion based. it could be as simple as there was a core with a bug one time...or a core with a performance feature/quirk...maybe it helps the pipeline. and I would likely expect whatever IT is to be specific to one or at least not all cores. helps one/some hurts on others. – old_timer Jun 02 '17 at 14:39
  • 1
    I wonder. What if you replaced the `add rdi, rsi; mov rax, rdi` with `lea rax, [rdi + rsi]`? – EOF Jun 02 '17 at 16:08
  • 7
    What makes this even more interesting is that multiple generations of the Intel compiler seem to be doing it (Cody Gray seems to be using the 2016 version, while I use the 2013 version), and it seems to happen independent of the specified architecture target, making it unlikely that it is used to work around a bug in an older architecture. Even in an OOO machine, it usually is not a good idea to lengthen dependency chains, as happens here. Might be a good idea to report this to Intel as a performance bug, and see what they come back with. – njuffa Jun 02 '17 at 16:11
  • @EOF That's a slightly different question, and the answer would probably depend on which microarchitecture you are optimizing for. On Haswell and later, where you have move-elision thanks to register renaming, it's very likely that `ADD`+`MOV` is "cheaper" than the `LEA` because `ADD` can be issued on more execution ports than `LEA`, and the `MOV` doesn't even require an execution port. On older microarchitectures, it would be a bit harder for me to say without a benchmark, since `ADD` can still run on more ports than `LEA`, but the `MOV` requires bandwidth, too. – Cody Gray - on strike Jun 03 '17 at 11:34
  • 2
    Anyway, that isn't what Intel's compiler is doing, so that's somewhat unrelated to this question. If you'd like to see more of a discussion about it, you could post a new question. . . . As for old_timer's comment, yes, I fully expect the answer to be somewhat obscure, as I've clearly already done my basic due diligence before asking. But that certainly doesn't make the question "opinion based". And like njuffa said, I'm seeing this on ICC for *all* architecture targets, so it does seem quite unlikely that this is working around a bug in or quirk of one particular generation. – Cody Gray - on strike Jun 03 '17 at 11:36
  • 4
    Perhaps it is to harm AMD. Yes this code is slower than optimal on Intel, but it hurts AMD more, thus making Intel processors look good in comparison. The ICC is full of code that actively disables optimizations if it detects a non-Intel processor. – Johan Jun 03 '17 at 12:54
  • Do you have examples where the compiler *does* use `sub`? If you write in C a long chain of dependent subtractions (making the code obviously subtract-throughput-limited), does it still use `neg`? – Iwillnotexist Idonotexist Jun 05 '17 at 18:08
  • At any rate, obligatory Godbolt link: https://godbolt.org/g/O17x7X – Iwillnotexist Idonotexist Jun 05 '17 at 18:25
  • @iwill Sure, there are cases where it uses `sub`, and I've seen 'em before. Test code doesn't immediately come to mind. See, the issue is that ICC will use `sub` for memory ops, like you see [here](https://godbolt.org/g/ESGlWv) (also unrolled; use -O1 to turn off), which makes sense, since a single `sub` with a memory operand is faster than a load-neg-add-store chain. The trick is going to be writing a long chain of dependent subtractions where everything is enregistered. That's the case where you might see it use neg+add. Like I said, one doesn't immediately come to mind; I'll think about it. – Cody Gray - on strike Jun 06 '17 at 05:42
  • 1
    It turned out to be a lot easier than I was thinking… ICC uses `sub` in obvious cases like subtracting two parameters. I can't remember any other good examples of complex code where I've seen it generate `neg`+`add` instead of a simple `sub`, but [here are some cases to consider](https://godbolt.org/g/hzCMMi). – Cody Gray - on strike Jun 06 '17 at 09:56
  • 1
    [See these cases](https://godbolt.org/g/WdVWcJ). ICC 17 refuses to elide the subtracts in `K: c + (-c - a*b)`, `M: (-c - a*b) + c` and `N: -c - a*b + c` but does optimize out the subtracts in `L: c + -c - a*b`, `O: (- a*b -c ) + c` and `P: - a*b -c + c`. Out of this I conclude that ICC fixates on the C order of operations: It won't commute integer subtracts (M,N), and if the addition of parentheses changes the order, an optimization can be disallowed (K,L). Particularly fascinating to me is (M,O): M has `-c` first in the parens and doesn't optimize, while O has `-a*b` first and does optimize. – Iwillnotexist Idonotexist Jun 06 '17 at 16:24
  • 1
    ICC is very good at auto-vectorizing, especially hard-to-vectorize stuff that gcc and clang don't manage to vectorize at all. But when looking at small functions like you're doing here, ICC often makes worse scalar integer code than gcc/clang do. I haven't benchmarked ICC's code on real programs, so maybe it makes up for often-sub-optimal code-gen with good IPO or something, as well as good auto-vectorization. – Peter Cordes Jun 30 '17 at 08:44
  • 1
    Anyway, I think this is just a missed-optimization. ICC doesn't magically produce perfect asm for every input. I'm pretty sure most of the time when ICC output looks sub-optimal according to Agner Fog's manuals and Intel's optimization guide, it really is sub-optimal. – Peter Cordes Jun 30 '17 at 08:51
  • Yup, for what it's worth looking at a lot of (mostly integer) snippets over the last couple of years the general order of code quality seems to be `clang` > `gcc` >> `icc`. Sometimes `gcc` makes better code than `clang` and often they are close to tied. In many simple cases all three are tied, but with almost anything a big complex `icc` is often producing code that just doesn't complete, sometimes twice as slow. Yet you still hear about `icc` being the fastest compiler around and people buying and recompiling with `icc` just to get that magic. – BeeOnRope Jul 23 '17 at 06:58
  • ... and that's not necessarily wrong: perhaps `icc` really does have some magic that overcomes all the various issues you see in codegen, such as better behavior on larger projects, better vectorization, more aggressive use of instruction set extensions, more finely tuned runtimes, etc. For example, a win on typical `memcpy` and `str*` function, and inlined copy generation might go a long way to canceling out a whole ton of scattered bad decisions elsewhere. – BeeOnRope Jul 23 '17 at 07:00
  • I am not familiar enough with this generation processor to give a definitive answer, but I used to write video games in assembler and there were many reasons to do optimizations that were not intuitive. Avoiding failed branch predictions, setting up for delayed branch instructions, aligning code in a cache etc. There are a lot of tricks you can do that spend a cycle or two to save a lot more. Or it could be a bug. – MEHM- Nov 29 '17 at 23:52
  • 1
    From an EE perspective, neg + add is sometimes how the bits are actually subtracted using gates. It may be that (due to the processor being CISC) the two calls are actually speed-identical in the Intel ALU, and there is no reason to change working code. Especially if that working code runs slower on a competitive processor. Remember that Intel knows a lot about multi threading, which means separate clocks in separate cores and a branch using a block like this may even combine these two instructions into one tick. If they've patented that functionality, all the more reason to use it. – Dylan Brams Dec 20 '17 at 22:15
  • also have a look at https://web.itu.edu.tr/kesgin/mul06/intel/instr/sub.html – nullqube Mar 23 '18 at 06:37

1 Answers1

2

Strangely I have a simple answer: Because ICC isn't optimal.

When you write own compiler you get started with some very basic set of operation codes: NOP, MOV, ADD... up to 10 opcodes. You don't use SUB for a while because it might easily be replaced by: ADD NEGgative operand. NEG isn't basic as well, as it might be replaced by: XOR FFFF...; ADD 1.

So you implement rather complex bit-based addressing of operand types and sizes. You do it for a single machine code instruction (eg. ADD) and plan to use it further for most other instructions. But by this time your co-worker finishes implementation of optimal calculation of remainder without use of SUB! Imagine - it's already called "Optimal_Mod" so you miss some inoptimal thing inside not because you're a bad guy and hate AMD but just because you see - it's already called optimal, optimized.

Intel Compiler is pretty good in general, but it has a long version history, so it can behave strange in some rare cases. I suggest you inform Intel about this issue and look what will happen.

kay27
  • 897
  • 7
  • 16
  • 2
    Your first paragraph is right, but ICC isn't so simplistic that your 2nd paragraph is plausible. (Also, `xor` with -1 is ones-complement negation (`not`), not 2's complement (`neg`). `x ^ 0xFFFFFFFF = -x - 1`; so `not`+`inc` emulates `neg`). Your 3rd paragraph is more plausible: that this might be part of a "canned" sequence for modulo by a constant and doesn't get optimized the normal way. – Peter Cordes Apr 24 '18 at 22:35
  • @PeterCordes Thank you, I edited my answer, added missed `ADD 1` for `NEG`. I agree, ICC isn't a simplistic - it is highly complicated. But it was also started from scratch once upon a time. – kay27 Apr 25 '18 at 05:41
  • 1
    It was almost certainly designed from the ground up to be an optimizing compiler, and didn't necessarily ever have a version that didn't transform through some internal representations of the program flow. Also, I'd guess it'd be easier to emit single instructions like `sub` for single operations like subtraction. They already need a built-in assembler that supports `sub`, so there's no reason to avoid emitting it in favour of some canned neg/add sequence; especially because that destroys the source operand. – Peter Cordes Apr 25 '18 at 06:02
  • 1
    A canned sequence of instructions that's expanded too late for the normal optimizer is the only sane explanation, but that pre-made sequence might be in some internal representation, not directly as x86 instructions. I can't imagine they wouldn't have fixed this by now if there's a literal `neg`/`add` somewhere in the compiler's recipe for doing this operation, unless nobody's ever reported it. – Peter Cordes Apr 25 '18 at 06:07
  • @PeterCordes I wonder did you ever use % in time-critical loops? I did, but I didn't use ICC to build it :) I can suppose it's rarity and it's not such a meaningful loss in most cases. But this is still the loss, definitely. – kay27 Apr 25 '18 at 06:24
  • Most libc implementations convert integers to strings with `digit = total % 10;` and `total /= 10;` This is a pretty critical operation, especially with more and more complex software serializing data as text. [This is what glibc's `_itoa_word` does](https://code.woboq.org/userspace/glibc/stdio-common/_itoa.c.html#_itoa_word), for internal use by `printf` and other functions. Note the special-casing of base 16, 10, and 8 so the compiler will make special loops for those bases as compile-time constants, even though callers pass the base as an integer. – Peter Cordes Apr 25 '18 at 06:38
  • 1
    I even used a SIMD version of that in something I wrote for fun, to turn an AVX2 vector of xorshift128+ results into ASCII decimal digits: [What's the fastest way to generate a 1 GB text file containing random digits?](//unix.stackexchange.com/a/324520). (And yes, I used an actual non-overloaded `%` operator in the C99 source, with GNU C native vectors to get gcc to work out the magic constants, but Intel intrinsics for the other SIMD stuff :) – Peter Cordes Apr 25 '18 at 06:40
  • @PeterCordes Data serialization commonly means that the data comes from somewhere and, after being serialized, comes out to some device. Even if the CPU doesn't input/output directly, I/O process takes significally more time than one inoptimal `NEG` operation... Nevertheless, no reason not to fix the issue for Intel. – kay27 Apr 25 '18 at 07:57
  • Modern bloated software often serializes between different processes on the same machine, passing data through memory. Or maybe even within the same process, no I/O involved at all, if it's cobbled together out of high-level library APIs. I think it's not uncommon for integers to get converted to strings. That was just one major example; indexing a non-power-of-2 hash table would be another case. (Err, but that's only if it's fixed-size. A naive implementation would likely end up with the compiler only seeing a runtime variable). Depends how tight a loop you want as an example :P – Peter Cordes Apr 25 '18 at 08:03