12

I've read many times over the years that you should do XOR ax, ax because it is faster... or when programming in C use counter++ or counter+=1 because they would INC or ADD... Or that in the Netburst Pentium 4 the INC was slower than ADD 1 so the compiler had to be warned that your target was a Netburst so it would translate all var++ to ADD 1...

My question is: Why INC and ADD have different performances? Why for example INC was claimed to be slower on Netburst while faster than ADD in other processors?

Mysticial
  • 464,885
  • 45
  • 335
  • 332
speeder
  • 6,197
  • 5
  • 34
  • 51
  • I think this question is relevant only to x86 architectures. – Ira Baxter Aug 28 '12 at 16:44
  • I don't know of any micro-architectures where `inc` is faster in and of itself. The only advantage that I can see it might ever have is a smaller size. By the way, `x++` and `x+=1` do not necessarily translate to `inc` and `add` respectively except in ultra lame compilers. – harold Aug 28 '12 at 16:47
  • On x86 architectures, with variable length instruction encoding, there could be occasions where either one is preferable over the other. If the shorter on would fit in a cache line or decode block where the larger wouldn't, then it will come out ahead. If the shorter one would leave half of the next instruction in the window, and the remaining half in the next window, the larger one might be better by aligning its successor nicely. – Phil Miller Aug 28 '12 at 17:00
  • @Lưu Vĩnh Phúc look at the question date, my question is older than the one you linked, it can't be a duplicate (unless you believe I can time travel) – speeder Apr 21 '16 at 12:44
  • @speeder ["The general rule is to keep the question with the best collection of answers, and close the other one as a duplicate"](https://meta.stackexchange.com/a/10844/230282), time isn't relevant here. Tons of 2010 questions were closed by better 2016 questions – phuclv May 21 '17 at 10:27

2 Answers2

18

For the x86 architecture, INC updates on a subset of the condition codes, whereas ADD updates the entire set of condition codes. (Other architectures have different rules so this discussion may or may not apply).

So an INC instruction must wait for other previous instructions that update the condition code bits to finish, before it can modify that previous value to produce its final condition code result.

ADD can produce final condition code bits without regard to previous values of the condition codes, so it doesn't need to wait for previous instructions to finish computing their value of the condition codes.

Consequence: you can execute ADD in parallel with lots of other instructions, and INC with fewer other instructions. Thus, ADD appears to be faster in practice.

(I believe there is a similar issue with working with 8 bit registers (e.g., AL) in the context of full width registers (e.g., EAX), in that an AL update requires that previous EAX updates complete first).

I don't use INC or DEC in my high performance assembly code anymore. If you aren't ultrasensitive to execution times, then INC or DEC is just fine and can reduce the size of your instruction stream.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • This sounds interesting. Do you *know* the EFLAGS issue causes micro-architectural delays for INC or is this a speculation? I wonder if there is some speculative way to treat EFLAGS to eliminate most delays for INC/DEC. – srking Aug 28 '12 at 16:51
  • 1
    @srking many micro archs split EFLAGS in separate parts to avoid that false dependency. Stalls still happen when the parts have to be recombined (such as jumps that use the carry flag together with some other flag). – harold Aug 28 '12 at 16:59
  • I don't know how smart the processor(s) are, but those engineers have lot of transistors to play with. An interesting observation: if an INC instruction is followed by an ADD, the condition codes for the INC are no longer interesting and the dependency on previous CC values is technically not needed by the INC. So, following an INC by something that thrashes the entire set of CC bits may speed it up :-} Pretty weird to put your optimization code *after* the computation! – Ira Baxter Aug 29 '12 at 03:18
  • I'm pretty sure the partial eflags stall is documented in the Intel optimization guide. Or maybe in Agner Fog's whitepapers. – Andy Ross Aug 29 '12 at 21:27
  • the AL/AX/EAX partition register problem is the reason [why most x64 instructions zero the upper part of a 32 bit register](http://stackoverflow.com/q/11177137/995714) – phuclv Apr 05 '16 at 15:28
  • See also [a followup question](https://stackoverflow.com/questions/36510095/inc-instruction-vs-add-1-does-it-matter) that goes more into the CPU internals of why there's a difference, and what's special about partial-flag updates. In my answer there, I explained why avoiding INC/DEC is only useful in rare cases on CPUs other than P4. I marked this question as a duplicate of that, since it looks to me like a more-detailed re-asking of the same question as this. – Peter Cordes May 30 '17 at 02:32
  • Seems like a pretty good update. I'm glad my answer is now officially stale and incorrect. Thanks, Peter. – Ira Baxter May 30 '17 at 03:48
4

The XOR ax, ax bit is, I gather a few years out of date, and assigning zero now beats it (so I'm told).

The C bit about counter++ rather than counter+=1 is a couple of decades out of date. Definitely.

The simple reason for the first one with assembly, is that all instructions will be translated into some sort of operation on the part of the CPU, and while the designers will always try to make everything as fast as possible, they'll do a better job with some than with others. It's not hard to imagine how an INC could be faster since it only has to deal with one register, though that's grossly over-simplifying (but I don't know much about these things, so over-simplify is all I can do on that part).

The C one though, is long ago nonsense. If we have a particular CPU where INC beats ADD, why on earth would the compiler designer not use INC instead of ADD, for both counter++ and counter+=1? Compilers do a lot of optimisations, and that sort of change is far from the most complicated.

Jon Hanna
  • 110,372
  • 10
  • 146
  • 251
  • 7
    Actually the `xor` trick is even extra better now. On SandyBridge is handled by the register renamer - it doesn't even go to any execution unit anymore. – harold Aug 28 '12 at 16:49
  • @harold sweet. Yet another reason to be glad I'm benefiting from the compiler people knowing more about this than I ever will ;) – Jon Hanna Aug 28 '12 at 16:54
  • @harold Sandy Bridge even special cases for `sub X,X` as well. Not sure why it'd need to since all sane compilers already use `xor X,X` anyway. – Mysticial Aug 28 '12 at 16:56
  • @Mysticial that's a bit backwards, surely? If they only concentrated on what the compilers currently used, progress in the state of the art would be stymied, no? – Jon Hanna Aug 28 '12 at 17:01
  • 1
    @Mysticial, it doesn't particularly matter what sane compilers do, Intel spends a LOT of time trying to get fortran/cobol/assembly code that was compiled 20 years ago with crappy compilers to run faster. In many cases the source code is nowhere to be found, or the toolchains that produced the executable don't exist anymore. – Danny Aug 29 '12 at 02:13