Is ADD 1 really faster than INC ? x86

Question

I have read various optimization guides that claim ADD 1 is faster than using INC in x86. Is this really true?

@A.Webb because it depends on the microarchitecture and the context. He'd have to do complicated tests on a lot of different cpu's. Why do that if you can just ask? — harold, Nov 14 '12 at 17:03
@harold: If it's between him testing it and us testing it to write an answer for him on Stack Overflow, I choose him doing it. — Lightness Races in Orbit, Nov 14 '12 at 17:29
@LightnessRacesinOrbit this isn't a "either the OP benchmarks this or we do". This information already exists, so we don't have to test anything. Also, the OP probably couldn't possibly test this himself. — harold, Nov 14 '12 at 17:36
@harold: to be fair, everyone can test this themselves. The only required materials are an x86 machine, an assembler and a stopwatch. Crafting an instruction stream to exhibit the difference requires a little creativity, but it's not rocket science (for that matter, *rocket science* isn't rocket science). — Stephen Canon, Nov 14 '12 at 17:43
@StephenCanon that's only easy if you *already* know why there is/could be a difference. — harold, Nov 14 '12 at 17:49
Life is full of challenges. I prefer to see people tackling those challenges... or at least taking a crack at it. — Lightness Races in Orbit, Nov 14 '12 at 18:59
Its not easy to test this. There are a lot of situational conditions that can affect the results. I was hoping for someone with a lot of experience with different microarchitectures to explain their practical knowledge about the subject. — Tyler Durden, Nov 14 '12 at 19:07
@TylerDurden: if it's still unclear after reading my short answer, I would encourage you to download Intel's Optimization Manual and read the relevant sections; it would take a lot of work to answer the question any more clearly than the manual does. — Stephen Canon, Nov 14 '12 at 20:27
Really guys, this is a hard one. If it was "add vs and" or something like that then sure, anyone could figure it out. But this is altogether different. Most people are just going to throw an `inc` and an `add` in a loop and they would conclude there is no difference. And there would be no indication that the answer was inaccurate. — harold, Nov 14 '12 at 22:59
@harold: no doubt; it took me a good 3 or 4 hours to figure out what was going on when I first encountered this stall (writing a bignum addition routine). — Stephen Canon, Nov 15 '12 at 14:48
Closing this question because somebody posted a similar question 4 years later is pretty bogus. My question was the first on the subject and states the problem clearly. The answers to my question are more or less conclusive. If anything the OTHER question should be closed, not mine. — Tyler Durden, Mar 22 '21 at 23:43

score 34 · Accepted Answer · answered Nov 14 '12 at 16:58

34

On some micro-architectures, with some instruction streams, INC will incur a "partial flags update stall" (because it updates some of the flags while preserving the others). ADD sets the value of all of the flags, and so does not risk such a stall.

ADD is not always faster than INC, but it is almost always at least as fast (there are a few corner cases on certain older micro-architectures, but they are exceedingly rare), and sometimes significantly faster.

For more details, consult Intel's Optimization Reference Manual or Agner Fog's micro-architecture notes.

answered Nov 14 '12 at 16:58

Stephen Canon

103,815
19
183
269

1

Today anyways, was quite different back when I started programming. Back then INC was faster. :-) – Brian Knoblauch Nov 14 '12 at 20:24
3

When P4 was current, `add` was preferred. Now that P4 is more or less dead and buried, `inc` is preferred in most cases because it's shorter, and runs at the same speed as `add`. If you want to avoid modifying the carry flag, use `lea reg, [reg+1]` to not modify *any* flags, avoiding the dreaded partial-flag stall. Or if possible, avoid doing the increment between the flag producer and flag consumer. AMD K8 through Steamroller, and Intel P6 / Sandybridge families all track flag dependencies separately for different flag bits. e.g. CF is tracked by itself, to avoid false deps like with `inc` – Peter Cordes Oct 28 '15 at 07:17
1

Update: Intel since Skylake (maybe also Broadwell) never merges FLAGS, CF and the other flags (SPAZO) are simply read as 2 separate inputs by instructions like `cmovbe` that need both. Most cmov instructions are 1 uop, but those that need both parts of EFLAGS are still 2 uops on modern Intel. (See @BeeOnRope's answer on [What is a Partial Flag Stall?](https://stackoverflow.com/q/49867597)). But this means that `inc`/`dec` are fully efficient even in ADC loops; no flag-merging so no advantage to `lea reg, [reg+1]`. – Peter Cordes Nov 16 '20 at 17:04

score 6 · Answer 2 · answered May 08 '16 at 02:17

While it's not a definite answer. Write this C file:

=== inc.c ===
#include <stdio.h>
int main(int argc, char *argv[])
{
    for (int n = 0; n < 1000; n++) {
        printf("%d\n", n);
    }
    return 0;
}

Then run:

clang -march=native -masm=intel -O3 -S -o inc.clang.s inc.c
gcc -march=native -masm=intel -O3 -S -o inc.gcc.s inc.c

Note the generated assembly code. Relevant clang output:

mov     esi, ebx
call    printf
inc     ebx
cmp     ebx, 1000
jne     .LBB0_1

Relevant gcc output:

mov     edi, 1
inc     ebx
call    __printf_chk
cmp     ebx, 1000
jne     .L2

This proves that both clang's and gcc's authors thinks INC is the better choice over ADD reg, 1 on modern architectures.

What would that mean for your question? Well, I would trust their judgement over the guides you have read and conclude that INC is just as fast as ADD and that the one byte saved due to the shorter register encoding makes it preferable. Compiler authors are just people so they can be wrong, but it is unlikely. :)

Some more experimentation shows me that if you don't use the -march=native option, then gcc will use add ebx, 1 instead. Clang otoh, always likes inc best. My conclusion is that when you asked the question in 2012 ADD was sometimes preferable but now in the year 2016 you should always go with INC.

Yup, looking at what compilers choose is often a good strategy. (Even in 2012, `inc` was totally fine, though. P4 was already irrelevant at that point.) I've noticed that gcc's instruction cost estimates seem to focus more on latency than throughput. Maybe that's a good strategy in general. e.g. it will use two `lea` instructions to replace a multiply by a constant, even when tuning for Haswell. clang does prefer code-size / insn count / throughput by using `imul r32, r32, imm` for multiplying by small constants, unless it can do it with a single LEA (like `lea eax, [rcx+rcx*4]`). — Peter Cordes, May 08 '16 at 03:27
This is actually no longer true: https://godbolt.org/z/Nup-2I. GCC uses `add ebx, 1` for `-O0` to `-O3` and `inc ebx` for `-Os`. — r00ster, Oct 27 '19 at 06:13
You have to add `-march=native` to get gcc to use `inc` instead of `add`. — Björn Lindqvist, Oct 27 '19 at 19:10

Is ADD 1 really faster than INC ? x86

2 Answers2

Linked

Related