2

I have a struct which holds values that are used as the arguments of a for loop:

struct ARGS {
    int endValue;
    int step;
    int initValue;
}

ARGS * arg = ...; //get a pointer to an initialized struct
for (int i = arg->initValue; i < arg->endValue; i+=arg->step) {
    //...
}

Since the values of initValue and step are checked each iteration, would it be faster if I move them to local values before using in the for loop?

initValue = arg->initValue;
endValue = arg->endValue;
step = arg->step;

for (int i = initValue; i < endValue; i+=step) {
    //...
}
6502
  • 112,025
  • 15
  • 165
  • 265
SIMEL
  • 8,745
  • 28
  • 84
  • 130
  • 3
    The optimizer would likely move them to registers anyway. Why not look at the assembler output? – Jiminion May 07 '14 at 17:00
  • 1
    @Jim what about multithreading? Ilya - why don't you profile and find out? – Luchian Grigore May 07 '14 at 17:01
  • First, everything depends on a) what is inside the loop, and b) what is outside the loop. It is easy to construct code where the `for` loop itself, its testing and incrementing, take 99.9% of the time, or 0.1% of the time, or 0.1% in each of 999 different places. Second, people say "profile", but not just any profiler will do. You need one that gives line-level, if not instruction-level, resolution, on wall-clock time. [*Here's my method.*](http://stackoverflow.com/a/378024/23771) – Mike Dunlavey May 07 '14 at 18:24
  • @MikeDunlavey: I would be surprised if interrupting the process inside a debugger would be able to pinpoint a difference between `i+=step` and `i+=arg->step`. Now what it will be useful for is *ignoring* this detail, as 99.99% of the case most time will be spent somewhere else! – David Rodríguez - dribeas May 07 '14 at 19:54
  • @DavidRodríguez-dribeas: It doesn't pinpoint differences. It pinpoints instructions taking a large fraction of time, *if* they take a large fraction of time. If they don't, it tells you what does. If `i += arg->step` is taking more than 10% of time, it will tell you, and it will tell you if the `->` is taking it. If so, you fix it and measure the difference with overall timing. What interrupting does is give you pinpoint precision, not of time, but of what is taking time. – Mike Dunlavey May 07 '14 at 20:18
  • If you want to know which of two things is faster then *write the code both ways, run it both ways, and then you'll know*. If you have two horses and you want to know which is faster, *race them*, don't *talk about racing them*. – Eric Lippert May 07 '14 at 21:30
  • @Eric: The problem is, without context, it's meaningless. I'm sure you've seen the questions on SO where somebody asks Is Loop X Faster than Loop Y, when there's a big-old `cout` in the middle of it :) – Mike Dunlavey May 08 '14 at 01:17

3 Answers3

4

The clear cut answer is that in 99.9% of the cases it does not matter, and you should not be concerned with it. Now, there might be different micro differences that won't matter to mostly anyone. The gory details depend on the architecture and optimizer. But bear with me, understand not no mean very very high probability that there is no difference.

// case 1
ARGS * arg = ...; //get a pointer to an initialized struct
for (int i = arg->initValue; i < endValue; i+=arg->step) {
    //...
}

// case 2
initValue = arg->initValue;
step = arg->step;

for (int i = initValue; i < endValue; i+=step) {
    //...
}

In the case of initValue, there will not be a difference. The value will be loaded through the pointer and stored into the initValue variable, just to store it in i. Chances are that the optimizer will skip initValue and write directly to i.

The case of step is a bit more interesting, in that the compiler can prove that the local variable step is not shared by any other thread and can only change locally. If the pressure on registers is small, it can keep step in a register and never have to access the real variable. On the other hand, it cannot assume that arg->step is not changing by external means, and is required to go to memory to read the value. Understand that memory here means L1 cache most probably. A L1 cache hit on a Core i7 takes approximately 4 cpu cycles, which roughly means 0.5 * 10-9 seconds (on a 2Ghz processor). And that is under the worst case assumption that the compiler can maintain step in a register, which may not be the case. If step cannot be held on a register, you will pay for the access to memory (cache) in both cases.

Write code that is easy to understand, then measure. If it is slow, profile and figure out where the time is really spent. Chances are that this is not the place where you are wasting cpu cycles.

David Rodríguez - dribeas
  • 204,818
  • 23
  • 294
  • 489
  • may be 10^9 is missing a minus sign? – 6502 May 07 '14 at 17:27
  • you are right that it need to assure that step field of the structure is not changing, but it has nothing to do with multithreading - compiler will optimize assuming only this thread modifies the variable. If you want correct ordering of read/write operations on memory (for eg. multithreading value access) you need to (at the very least) declare your variables `volatile`. – j_kubik May 07 '14 at 17:29
  • @6502: is it? are you sure that it does not take 15 years to access L1? Uhm... maybe you are right. Thanks :) – David Rodríguez - dribeas May 07 '14 at 17:30
  • @j_kubik: Yes, you are right to the extent that if there is a function call inside the loop it cannot assume that the function will not modify the variable. But you are completely **wrong** on the `volatile` count. – David Rodríguez - dribeas May 07 '14 at 17:33
  • *"A L1 cache hit on a Core i7 takes approximately 4 cpu cycles, which roughly means 0.5 * 10-9 seconds (on a 2Ghz processor)"*. Oh my, how do you know/measure that figure? :| – Nawaz May 07 '14 at 17:49
  • Great explanation BTW. +1. – Nawaz May 07 '14 at 17:49
  • 1
    @Nawaz: Intel engineers meassure that: https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf, page 8: *The local copies of the lines that are accessed in this way are kept in the 32KB L1 data cache. The access latency to this cache is 4 cycles.* – David Rodríguez - dribeas May 07 '14 at 18:07
2

This depends on your architecture. If it is a RISC or CISC processor, then that will affect how memory is accessed, and on top of that the addressing modes will affect it as well.

On the ARM code I work with, typically the base address of a structure will be moved into a register, and then it will execute a load from that address plus an offset. To access a variable, it will move the address of the variable into the register, then execute the load without an offset. In this case it takes the same amount of time.

Here's what the example assembly code might look like on ARM for accessing the second int member of a strcture compared to directly accessing a variable.

ldr r0, =MyStruct        ; struct {int x, int y} MyStruct
ldr r0, [r0, #4]         ; load MyStruct.y into r0

ldr r1, =MyIntY          ; int MyIntX, MyIntY
ldr r1, [r1]             ; directly load MyIntY into r0.

If your architecture does not allow addressing with offsets, then it would need to move the address into a register and then perform the addition of the offset.

Additionally, since you've tagged this as C++ as well, if you overload the -> operator for the type, then this will invoke your own code which could take longer.

rjp
  • 1,760
  • 13
  • 15
  • currently this is for a program that will run on x86, but I'm asking as a more general issue. – SIMEL May 07 '14 at 17:04
  • 2
    It can't be generalized. It is architecture dependent. x86 supports register+offset addressing, so for x86 it is the same. I would imagine the only time a struct would be slower would be on very simple microcontrollers (PIC, 8051 possibly). – rjp May 07 '14 at 17:06
  • as 6502 noted, it will be the same provided that code under `...` can be proven not to change the structure. This is not architecture dependent. – j_kubik May 07 '14 at 17:26
  • @j_kubik Depending on the scope of the struct, if it is global, I believe any function calls would limit optimization, as 6502 points out as well. On any modern architecture, though, regardless of whether it needs to fetch it from memory or cache it in registers, it should perform the same. – rjp May 07 '14 at 17:34
1

The problem is that the two version are not identical. If the code in the ... part modifies the values in arg then the two options will behave differently (the "optimized" one will use the step and end value using the original values, not the updated ones).

If the optimizer can prove by looking at the code that this is not going to happen then the performance will be the same because moving things out of loops is a common optimization performed today. However it's quite possible that something in ... will POTENTIALLY change the content of the structure and in this case the optimizer must be paranoid and the generated code will reload the values from the structure at each iteration. How costly it will be depends on the processor.

For example if the arg pointer is received as a parameter and the code in ... calls any external function for which the code is unknown to the compiler (including things like malloc) then the compiler must assume that MAY BE the external code knows the address of the structure and MAY BE it will change the end or step values and thus the optimizer is forbidden to move those computations out of the loop because doing so would change the behavior of the code.

Even if it's obvious for you that malloc is not going to change the contents of your structure this is not obvious at all for the compiler, for which malloc is just an external function that will be linked in at a later step.

6502
  • 112,025
  • 15
  • 165
  • 265