Relative performance of x86 inc vs. add instruction

Question

Quick question, assuming beforehand

mov eax, 0

which is more efficient?

inc eax
inc eax

or

add eax, 2

Also, in case the two incs are faster, do compilers (say, the GCC) commonly (i.e. w/o aggressive optimization flags) optimize var += 2 to it?

PS: Don't bother to answer with a variation of "don't prematurely optimize", this is merely academic interest.

The answer will probably be processor-specific and in most cases there will most likely be no measurable difference. If you're *really* interested in knowing the answer for a specific CPU then benchmark it. — Paul R, May 13 '11 at 14:32
Possible duplicate of [Is ADD 1 really faster than INC ? x86](http://stackoverflow.com/questions/13383407/is-add-1-really-faster-than-inc-x86) — phuclv, May 21 '17 at 10:26

score 21 · Answer 1 · edited May 23 '17 at 12:02

21

Two inc instructions on the same register (or more generally speaking two read-modify-write instructions) do always have a dependency chain of at least two cycles. This is assuming a one clock latency for a inc, which is the case since the 486. That means if the surrounding instructions can't be interleaved with the two inc instructions to hide those latencies, the code will execute slower.

But no compiler will emit the instruction sequence you propose anyway (mov eax,0 will be replaced by xor eax,eax, see What is the purpose of XORing a register with itself?)

mov eax,0
inc eax
inc eax

it will be optimizied to

mov eax,2

edited May 23 '17 at 12:02

Community

1
1

answered May 13 '11 at 14:55

Gunther Piez

29,760
6
71
103

1

Do note that `xor eax, eax; inc eax` is favoured over `mov eax, 1` by most compilers, though. May be due to the fact that it's 3 bytes rather than 5. – Polynomial Aug 15 '13 at 22:27
@LưuVĩnhPhúc `mov eax, 1` is 5 bytes: `b8 01 00 00 00`. It's 10 bytes for 64-bit, due to the 8-byte literal and QWORD prefix: `48 b8 01 00 00 00 00 00 00 00`. Comparatively, `xor rax, rax; inc eax` is only 5 bytes: `48 31 c0 ff c0` – Polynomial May 26 '14 at 22:27
@Polynomial: All modern mainstream compilers will use `mov eax,1`, unless you specifically tell them to optimize for size instead of speed (https://godbolt.org/z/Kn7jE5 - clang or ICC `-Os -m32` or MSVC `-O1` will use `xor`/`inc` in 32-bit mode. `gcc -Os -m32` still uses mov). When optimizing for speed, saving 2 bytes of code size isn't worth an extra uop for the back-end (or an extra instruction for the front-end to decode). clang `-Oz` to optimize for size *without* caring about speed will use `push 1` / `pop rax` in 64-bit mode. All those compilers use `mov` with normal options. – Peter Cordes Oct 17 '20 at 08:08
@phuclv: you should probably clean up your earlier comments; several errors in the early ones (but also in Polynomial's 2nd comment, with inflated byte-counts for 64-bit). The options are `mov eax, 1` (5 bytes) to set RAX=1 via implicit zero extension, or `push 1` / `pop rax` (3 bytes), or `xor eax,eax` / `inc eax` (4 bytes). But compilers just use `mov` unless optimizing for size over speed. [Tips for golfing in x86/x64 machine code](https://codegolf.stackexchange.com/a/132985) points out that a 3-byte `lea eax, [rdx+1]` from another register of known value can be useful – Peter Cordes Oct 17 '20 at 08:13

Necrolis · Answer 2 · 2011-05-13T18:31:25.090

If you ever wanna know raw performance stats of x86 instructions, see Dr Agner Fogs listings (volume 4 to be exact). As for the part about compilers, thats dependent on the compiler's code generator, and not something you should rely on too much.

on a side note: I find it funny/ironic that in a question about performance, you used MOV EAX,0 to zero a register instead of XOR EAX,EAX :P (and if MOV EAX,0 was done beforehand, the fastest variant would be to remove the inc's and add's and just MOV EAX,2).

score 3 · Answer 3 · answered May 13 '11 at 15:01

From the Intel manual that you can find here it looks like the ADD/SUB instructions are half a cycle cheaper on one particular architecture. But remember that Intel uses an out-of-order execution model for it's (recent) processors. This primarily means, performance bottlenecks show up wherever the processor has to wait for data to come in (eg. it ran out of things to do during the L1/L2/L3/RAM data-fetch). So if you're profiler tells you INC might be the problem; look at it form a data-throughput point of view instead of looking at raw cycle-counts.

Instruction              Latency1           Throughput         Execution Unit 
                                                            2 
CPUID                    0F_3H    0F_2H      0F_3H    0F_2H    0F_2H 

ADD/SUB                  1        0.5        0.5      0.5      ALU 
[...]
DEC/INC                  1        1          0.5      0.5      ALU

IIRC 0f_2h is the P4 Prescott, may he rest in peace. Those half clock latencies are resulting from a internally double clocked pipeline. It turned out to be a very bad idea for Intel. — Gunther Piez, May 13 '11 at 15:06

karlphillip · Answer 4 · 2011-05-13T14:35:26.177

For all purposes, it probably doesn't matter. But take into account that inc uses less bytes.

Consider the following code:

int x = 0;
x += 2;

Without using any optimization flags, GCC compiles this code into:

80483ed:       c7 44 24 1c 00 00 00    movl   $0x0,0x1c(%esp)
80483f4:       00 
80483f5:       83 44 24 1c 02          addl   $0x2,0x1c(%esp)

Using -O1 and -O2, it becomes:

c7 44 24 08 02 00 00    movl   $0x2,0x8(%esp)

Funny, isn't it?

Relative performance of x86 inc vs. add instruction

4 Answers4

Linked

Related