Efficiency of postincrement v.s. preincrement in C++

Question

I usually think that preincrement is more efficient than postincrement in C++. But when I read the book Game Engine Architecture(2nd ed.) recently, there is a section says that postincrement is prefered than preincrement in for loop. Because, as I quote, "preincrement introduces a data dependency into your code -- the CPU must wait for the increment operation to be completed before its value can be used in the expression." Is this true? (It is really subverted my idea about this problem.)

Here is the quote from the section in case you are interested:

5.3.2.1 Preincrement versus Postincrement

Notice in the above example that we are using C++’s postincrement operator, p++, rather than the preincrement operator, ++p. This is a subtle but sometimes important optimization. The preincrement operator increments the contents of the variable before its (now modified) value is used in the expression. The postincrement operator increments the contents of the variable after it has been used. This means that writing ++p introduces a data dependency into your code -- the CPU must wait for the increment operation to be completed before its value can be used in the expression. On a deeply pipelined CPU, this introduces a stall. On the other hand, with p++ there is no data dependency. The value of the variable can be used immediately, and the increment operation can happen later or in parallel with its use. Either way, no stall is introduced into the pipeline.

Of course, within the “update” expression of a for loop (for(init_expr; test_expr; update_expr) { ... }), there should be no difference between pre- and postincrement. This is because any good compiler will recognize that the value of the variable isn’t used in update_expr. But in cases where the value is used, postincrement is superior because it doesn’t introduce a stall in the CPU’s pipeline. Therefore, it’s good to get in the habit of always using postincrement, unless you absolutely need the semantics of preincrement.

Edit: Add "the above example".

void processArray(int container[], int numElements)
{
    int* pBegin = &container[0];
    int* pEnd = &container[numElements];
    for (int* p = pBegin; p != pEnd; p++)
    {
        int element = *p;
        // process element...
    }
}

void processList(std::list<int>& container)
{
    std::list<int>::iterator pBegin = container.begin();
    std::list<int>::iterator pEnd = container.end();
    std::list<inf>::iterator p;
    for (p = pBegin; p != pEnd; p++)
    {
        int element = *p;
        // process element...
    }
}

I don't agree with the assertion from the quoted text, but: early CPUs had "baked-in" support for post-increment and pre-decrement addressing modes. See [the Motorola 68000 instruction set details](https://en.wikipedia.org/wiki/Motorola_68000#Instruction_set_details) for example. Implementing post-increment or pre-decrement addressing modes required fewer CPU instructions than pre-increment and post-decrement. — Sam Varshavchik, Aug 15 '16 at 00:23
@SamVarshavchik Exactly. This comes from the PDP-11 if not before. Note that only two of the four possible instructions were provided in hardware. They were primarily to facilitate stack operations. — user207421, Aug 15 '16 at 01:10
Thanks for including the "above example". It appears however, that the book chose an unfortunate example to demonstrate the use of post increment. They don't use the result of the expression, so it makes no difference to the efficiency whatsoever - as sated in the second paragraph that you quoted. — eerorika, Aug 15 '16 at 01:10
I agree, I think they've let the air out of their own claim by failing to justify it. Because they've used `p++` in a case where it makes no difference, and because they've said this is a "subtle but sometimes important optimization", they're basically advocating cargo-cult programming in the opposite direction to the cargo-cult they deprecate. It's *not* good to get into the habit of using post- unless you absolutely need the semantics of pre-, because of the inefficiencies you'll introduce in other cases if you go around the place not thinking about your code. — Steve Jessop, Aug 15 '16 at 01:17
For iterators pre-increment is generally considered the more optimal because post increment requires making a copy. — Galik, Aug 15 '16 at 01:18
I doubt this is true for objects that implement the prefix/postfix ++ operators. For basic data types then maybe. For objects there's more performance gains to be achieved using the prefix operator due to not having to make a copy of the object, and returning it. — sashang, Aug 15 '16 at 01:21

score 29 · Accepted Answer · edited Jun 20 '20 at 09:12

29

pre-increment introduces a data dependency into your code -- the CPU must wait for the increment operation to be completed before its value can be used in the expression."

Is this true?

It is mostly true - although perhaps overly strict. Pre increment doesn't necessarily introduce a data dependency - but it can.

A trivial example for exposition:

a = b++ * 2;

Here, the increment can be executed in parallel with the multiplication. The operands of both the increment and the multiplication are immediately available and do not depend on the result of either operation.

Another example:

a = ++b * 2;

Here, the multiplication must be executed after the increment, because one of the operands of the multiplication depends on the result of the increment.

Of course, these statements do slightly different things, so the compiler might not always be able to transform the program from one form to the other while keeping the semantics the same - which is why using the post-increment might make a slight difference in performance.

A practical example, using a loop:

for(int i= 0; arr[i++];)
    count++;

for(int i=-1; arr[++i];) // more typically: (int i=0; arr[i]; ++i;)
    count++;

One might think that the latter is necessarily faster if they reason that "post-increment makes a copy" - which would have been very true in the case of non-fundamental types. However, due to the data dependency (and because int is a fundamental type with no overload function for increment operators), the former can theoretically be more efficient. Whether it actually is depends on the CPU architecture, and the ability of the optimizer.

For what it's worth - in a trivial program, on x86 arch, using g++ compiler with optimization enabled, the above loops had identical assembly output, so they are perfectly equivalent in that case.

Rules of thumb:

If the counter is a fundamental type and the result of increment is not used, then it makes no difference whether you use post/pre-increment.

If the counter is not a fundamental type and the result of the increment is not used and optimizations are disabled, then pre-increment may be more efficient. With optimizations enabled, there is usually no difference.

If the counter is a fundamental type and the result of increment is used, then post-increment can theoretically be marginally more efficient - in some CPU architecture - in some context - using some compiler.

If the counter is not a fundamental type and the result of the increment is used, then pre-increment is typically faster than post-increment. Also, see R Sahu's answer regarding this case.

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 15 '16 at 00:33

eerorika

232,697
12
197
326

Doesn't this depend where `b` is stored? If it's in memory, then even with `a = ++b * 2`, the part of the increment that actually takes the time, writing it back to memory, still can be executed in parallel with the multiplication, can't it? Whether it helps probably depends what arithmetic operations are available on the CPU, but for example `a = ++b * 2` -> `a = (b+1) * 2; b++` is a completely valid transformation if the optimizer wants to make it. There's no data dependency on modifying `b`, although there is one on the addition. – Steve Jessop Aug 15 '16 at 00:35
@SteveJessop sure, the multiplication wouldn't need to wait for writing the result into memory. We are considering a delay of ~1 cycle here. I'm not claiming that my trivial example couldn't be optimized by the compiler, but neither can I prove that all pre increments can be transformed into an optimal form. – eerorika Aug 15 '16 at 00:38
Well, either they've got benchmarks or they're blagging :-) – Steve Jessop Aug 15 '16 at 00:39
Aren't those expression have different semantics? – Severin Pappadeux Aug 15 '16 at 00:39
@SeverinPappadeux: sure: you *also* have to be able to somehow rewrite your loop to deal with that. So if you would have written `*++p = *++q` to copy some data, you have to figure out how to change the start and end conditions of the loop if you want to change that to `*p++ = *q++`. – Steve Jessop Aug 15 '16 at 00:40
Do you mean that `a = (1 + b++) * 2` is faster than `a = ++b * 2` ? (to compare same behaviour). – Jarod42 Aug 15 '16 at 00:43
@SteveJessop well, it is kind of pointless to compare speed/instructions of expression with different semantics. – Severin Pappadeux Aug 15 '16 at 00:48
@Jarod42 No. I mean that `a = b++ * 2` is faster (at least in theory, and only marginally) than `a = ++b * 2`. The trick is transforming your program so that even though individual expression has different semantics, the loop has the same (or sufficiently similar?) semantics as a whole. I'm sure that *Game Engine Architecture(2nd ed.)* has a practical example of that *in the above example* that was left out of the quote in OP. – eerorika Aug 15 '16 at 00:49
1

@SeverinPappadeux: no, it's not pointless. If you somehow know that one loop body will run faster than the other (despite giving the wrong result), then the question becomes whether or not you can find a loop end condition that fixes the behaviour without sacrificing more speed than you'd gain by using the faster version of the (rest of the) body. Maybe you can't find one, but supposing the gain would be significant if achieved then it's not at all pointless to attempt to determine whether or not you can. – Steve Jessop Aug 15 '16 at 01:02
I've added a practical loop example. – eerorika Aug 15 '16 at 01:04

score 13 · Answer 2 · answered Aug 15 '16 at 01:08

One point of data from my experience.

Changing a post-increment to a pre-increment of a std::map::iterator in for loops resulted in noticeable savings in a core algorithm at my work.

In general, when icrementing an iterator that is a class, i.e. it is not a pointer, you should notice savings when using the pre-increment operator. The reason for it is that the pre-increment operator function changes the object in place while the post increment operator function usually involves creation of a temporary object.

A pre-increment operator is usually implemented as:

typename& typename::operator++()
{
   // Change state
   ...

   // Return the object
   return *this;
}

while a post-increment operator is usually implemented as:

typename typename::operator++(int)
{
   // Create a temporary object that is a copy of the current object.
   typename temp(*this):

   // Change state of the current object
   ...

   // Return the temporary object.
   return temp;
}

This is dead on. If you are incrementing POD then it really doesn't matter. If it is a class under the hood, the preincrement is preferred as it can do less work. — EvilTeach, Aug 15 '16 at 01:22
These findings seem contradictory to the above answer. Don't know which one to accept. :) — SexyBeast, Oct 19 '16 at 10:53
@SexyBeast I don't think they are contradictory. "If the counter is not a fundamental type [...], pre-increment may be more efficient/is typically faster" (from the other answer) is exactly what this answer here conveys. — bers, Aug 09 '22 at 09:30

Efficiency of postincrement v.s. preincrement in C++

2 Answers2

Linked

Related