2

I came up with this question from following answer: Efficiency of postincrement v.s. preincrement in C++

There I've found this expression:

a = b++ * 2;

They said, above b++ can run parallel with multiplication.

How b++ run parallel with multiplication?

What i've understood about the procedure is: First we copy b's value to a temporary variable , then increment b , finally multiply 2 with that temporary variable. We're not multiplying with b but with that temporary variable , then how it will run parallel?

I've got above idea about temporary variable from another answer Is there a performance difference between i++ and ++i in C++?

Community
  • 1
  • 1
Al Mamun
  • 944
  • 9
  • 27
  • 2
    I think they meant that the incrementation can happen after the evaluation of the value of `a`, and the value of `a` doesn't have to wait for the increment to finish incrementing. – Rakete1111 Oct 03 '16 at 08:13
  • 1
    I never thought about this,. the question gives me a new dimension to think.. – SimpleGuy Oct 03 '16 at 08:20
  • 1
    I cannot speak for those who wrote the answer you linked, but you should have a look at [instruction pipelining](https://en.wikipedia.org/wiki/Instruction_pipelining). – Holt Oct 03 '16 at 08:21
  • please define "in parallel". If we're talking about the c++ theoretical memory model, the operations are sequential (and strictly ordered in relation to each other). The as-if rule *may* allow instruction instruction re-ordering to provide a performance benefit. Instruction pipelining will also play its part, but there will never be parallelism. – Richard Hodges Oct 03 '16 at 08:30
  • @RichardHodges I meant, two instruction is independent , doesn't matter which one executes first , or same time, result will be same. – Al Mamun Oct 03 '16 at 08:53
  • I see. In reality, the as-if rule will have more effect in real compiled code. in general, pre-increment will always be faster because it does not imply a copy. For an int or a double, the copy is irrelevant. For a more complex object, it is not. It is better practice to prefer pre-increment if possible. – Richard Hodges Oct 03 '16 at 08:55
  • How about using pointers? e.g. *b++; .. Many CPU architectures have load/store with auto incrementing index registers (ARM, etc). On these CPUs we can perform 2x C expressions in a single instruction / 'in parallel'. – PaulHK Oct 03 '16 at 10:09
  • 1
    @PaulHK: Pointers tend to break the aliasing detection of the optimizer. Basically, `*b++` can change _any_ integer unless explicitly proven otherwise. This often forces the compiler to insert extra loads. And since memory loads are far more expensive (10x-100x) then normal arithmetic, this generally is a pessimization. In fact, the chief manual optimization that's still viable today is to hoist a pointer dereference out of a loop, when you know it doesn't alias other variables. (C99 introduced `restrict` to help the optimizer with this) – MSalters Oct 03 '16 at 11:41

3 Answers3

3

What you are talking about is instruction level parallelism. The processor in this case can execute an increment on the copy of b while also multiplying the old b. This is very fine-grained parallelism, at the processor level, and in general you can expect it to give you some advantages, depending on the architecture.

In the case of pre-increment, instead, the result of the increment operation must be waited in the processor's pipeline before the multiplication can be executed, hence giving you a penalty. However, this is not semantically equivalent as the value of a will be different.

3

@SimpleGuy's answer is a pretty reasonable first approximation. The trouble is, it assumes a halway point between the simple theoretic model (no parallelization) and the real world. This answer tries to look at a more realistic model, still without assuming one particular CPU.

The chief thing to realize is that real CPU's have registers and cache. These exist because memory operations are far more expensive than simple math. Parallelization of integer increment and integer bitshift (*2 is optimized to <<1 on real CPU's) is a minor concern; the optimizer will chiefly look at avoiding load stalls.

So let's assume that a and b aren't in a CPU register. The relevant memory operations are LOAD b, STORE a and STORE b. Everything starts with LOAD b so the optimizer may move that up as far as possible, even before a previous instruction when possible (aliasing is the chief concern here). The STORE b can start as soon as b++ has finished. The STORE a can happen after the STORE b so it's not a big problem that it's delayed by one CPU instruction (the <<1), and there's little to be gained by parallelizing the two operations.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • But is it possible to run parallel all the instructions inside `b++ * 2`? (By keeping that temporary variable in mind) In case of `++b * 2` , we definitely unable to run all instructions in parallel. – Al Mamun Oct 03 '16 at 11:39
  • @AlMamun: So there are at least 5 operations, 3 of which are memory operations. Obviously you cannot run the memory operations in parallel; can't load a variable while you're also storing it. But you _can_ parallelize the store of `b` with the multiplication (bitshift) of `a`. – MSalters Oct 03 '16 at 11:43
0

The reason that b++ can run in parallel is as under

b++ is a post increment, which means the value of a variable would be incremented after use. The below line can be broken into two parts

a = b++ * 2

Part-1: Multiply b with 2

Part-2: Increment the value of b by 1

Since above two are not dependent on each other, they can be run in parallel.

Had case been of pre-increment, which means to increment before use

a = ++b * 2

The parts would have been

Part-1: Increment the value of b by 1

Part-2: Multiply (new) b with 2

As can be seen above, the part two can be run only after part 1 is executed, so there is a dependency and hence no parallelism.

SimpleGuy
  • 2,764
  • 5
  • 28
  • 45