21

I'm currently developing a very fast algorithm, with one part of it being an extremely fast scanner and statistics function. In this quest, i'm after any performance benefit. Therefore, I'm also interested in keeping the code "multi-thread" friendly.

Now for the question : i've noticed that putting some very frequently accessed variables and arrays into "Global", or "static local" (which does the same), there is a measurable performance benefit (in the range of +10%). I'm trying to understand why, and to find a solution about it, since i would prefer to avoid using these types of allocation. Note that i don't think the difference comes from "allocation", since allocating a few variables and small array on the stack is almost instantaneous. I believe the difference comes from "accessing" and "modifying" data.

In this search, i've found this old post from stackoverflow : C++ performance of global variables

But i'm very disappointed by the answers there. Very little explanation, mostly ranting about "you should not do that" (hey, that's not the question !) and very rough statements like 'it doesn't affect performance', which is obviously incorrect, since i'm measuring it with precise benchmark tools.

As said above, i'm looking for an explanation, and, if it exists, a solution to this issue. So far, i've got the feeling that calculating the memory address of a local (dynamic) variable costs a bit more than a global (or local static). Maybe something like an ADD operation difference. But that doesn't help finding a solution...

Community
  • 1
  • 1
Cyan
  • 13,248
  • 8
  • 43
  • 78
  • Are you comparing the time it takes to recalculate the value each time against how long it takes to retrieve it from a global variable? Yes, that will always be faster. But there's no reason that a global variable would be faster than a local variable. I'm not sure what the *question* is here. – Cody Gray - on strike Mar 06 '11 at 13:38
  • I'm comparing the time it takes to collect statistics, using the same code, but one storing/updating results into a local structure, and another one using a global (or local static) structure. The "local static" wins, hands down, everything else being equal. – Cyan Mar 06 '11 at 13:48
  • 1
    It's hard to tell *anything* without seeing the actual code. We don't even know what mechanism you are replacing. – Alex B Mar 06 '11 at 13:58
  • Not sure if it is useful to "provide the source", but well, if an example is helpful, let's say this one : " int stats[256]; while (p – Cyan Mar 06 '11 at 14:18
  • 1
    @Cyan: have you benchmarked that particular example you gave? Does it display the 10% performance difference? If not then it's not a useful example. Disassemble the real code, look at the part that accesses the variable, see if it could plausibly account for 10% of your runtime. If not, then there must be some subtle knock-on effect that's probably somewhat specific to your real code. If you can't produce an example that displays the issue it's going to be very hard for anyone but you to analyze its performance unless the cause is very obvious. Which it doesn't seem to be here :-) – Steve Jessop Mar 06 '11 at 14:45
  • Question: are you by any chance writing a profiler? – Mike Dunlavey Mar 06 '11 at 15:13
  • @Steve : the code i'm benchmarking is close to the one provided as an example. Just add a bit of loop unrolling, some initializations, a few final tests and here we are. Unfortunately, i'm not able to disassemble. This is beyond my current level. @mike : no sir. I'm just building custom benchmarks for specific parts of code. Nothing generic. – Cyan Mar 06 '11 at 20:34

4 Answers4

9

It really depends on your compiler, platform, and other details. However, I can describe one scenario where global variables are faster.

In many cases, a global variable is at a fixed offset. This allows the generated instructions to simply use that address directly. (Something along the lines of MOV AX,[MyVar].)

However, if you have a variable that's relative to the current stack pointer or a member of a class or array, some math is required to take the address of the array and determine the address of the actual variable.

Obviously, if you need to place some sort of mutex on your global variable in order to keep it thread-safe, then you'll almost certainly more than lose any performance gain.

Jonathan Wood
  • 65,341
  • 71
  • 269
  • 466
  • I fully agree with your statement : mutex just slows down everything. I even tried to allocate a few tables and select a "free one" on starting the function; it works, but has the same performance penalty as Local Variables; so this is useless complexity – Cyan Mar 06 '11 at 13:45
  • "some math is required" - generally the difference would be that for the global or local static, the address could be a constant fixup. For an automatic variable the address would be a constant offset applied to the current stack pointer. Depending on the addressing modes offered by the CPU, this may or may not actually affect performance. – Steve Jessop Mar 06 '11 at 14:43
  • @Steve: Yes, that's why I said it depends on the compiler and platform. Note that I was thinking in terms of non-static local variables. I would assume most static local variables would be stored similar to how global variables are. – Jonathan Wood Mar 06 '11 at 14:48
  • 1
    @Jonathan: sorry, wasn't trying to disagree, just expand. Also, if the optimizer notices that it's doing `sp+187` a lot, and this is likely to cost time, then it can cache the value of `sp+187` in a register, or move the sp for a portion of the routine if the OS doesn't mind. These are the sorts of thing the questioner will need to check before concluding that your example applies in this case. – Steve Jessop Mar 06 '11 at 14:51
  • @Steve: Yes, for this level of optimization, a dump of the assembly language being produced by the compiler should be in order. – Jonathan Wood Mar 06 '11 at 14:53
  • There is no math required. x86 can do mov reg, [esp +/- offset] It even can do indirect addressing via mov reg, [esp + offset + 4*reg] ect. – Nils Pipenbrinck Mar 06 '11 at 16:38
  • @Nils: `esp + offset + 4 * reg` looks like math to me. Are you suggesting there is no case where a fixed address would be faster? – Jonathan Wood Mar 06 '11 at 16:56
  • It really **does** depend lot on the hardware. For example on an x86_64 the global address will add 8 bytes to the instruction size. The offset from a register will be smaller. I have seen cases where accessing members of a C++ class is faster than global variables, just because the address is loaded only once. – Bo Persson Mar 06 '11 at 17:05
  • OK, i'm using x86_32 compilator MS Visual Cpp. I do understand that the effect mentioned may be different in another environment, such as x86_64. – Cyan Mar 06 '11 at 20:14
  • @Cyan: Do you not think this effect is the answer to your question? – Jonathan Wood Mar 06 '11 at 22:14
  • @Jonathan : yes i do. Your answer is best candidate so far. Steve comments are also very informative. However, it does only provide some good insight for the "why". But so far, no work-around proposed... – Cyan Mar 06 '11 at 22:44
  • @Cyan: Feel free to mark as the answer whatever answer you think is best. However, I'm not sure there *is* a workaround. Addressing memory in certain ways can be more efficient in other ways. If you really want to narrow this down, export an assembly listing using your compiler and see exactly what is happening. Then see if you can tweak that a bit. But the speed of relative addressing types cannot be worked around. – Jonathan Wood Mar 06 '11 at 22:54
7

Creating local variables can be literally free if they are POD types. You likely are overflowing a cache line with too many stack variables or other similar alignment-based causes which are very specific to your piece of code. I usually find that non-local variables significantly decrease performance.

Puppy
  • 144,682
  • 38
  • 256
  • 465
  • Your statement is correct in most "normal" circumstances, but here i'm benchmarking differences in favor of non-local variables. So it is not a matter of "if it happens", because it does. The question is "why". – Cyan Mar 06 '11 at 14:27
  • 1
    @Cyan: I believe that I may have suggested an answer, such as cache line overflowing. Please do read the *whole* answer. – Puppy Mar 06 '11 at 14:55
  • @Jens Gustedt: Plain Old Data, such as ints, etc. – George Mar 06 '11 at 16:35
  • 1
    don't feel offended ! Yes, I've read your answer, and i already agree with the fact that no difference comes from allocation. Now your comment about cache line is not so clear to me. The amount of data created (~3K) in the stack is well below the L1 cache size (~32K), so it should fit entirely in L1 cache. I don't know how i can ensure if it is cache-line aligned (ie an address multiple of 64), nor if it would make any difference. In fact, i've made a few tests on this (by making one of the table a little bit bigger or smaller), and it does not seem to help. – Cyan Mar 06 '11 at 20:42
  • Vars are heavy, please don't joke like that –  Oct 30 '21 at 21:28
1

It's hard to beat static allocation for speed, and while the 10% is a pretty small difference, it could be due to address calculation.

But if you're looking for speed, your example in a comment while(p<end)stats[*p++]++; is an obvious candidate for unrolling, such as:

static int stats[M];
static int index_array[N];
int *p = index_array, *pend = p+N;
// ... initialize the arrays ...
while (p < pend-8){
  stats[p[0]]++;
  stats[p[1]]++;
  stats[p[2]]++;
  stats[p[3]]++;
  stats[p[4]]++;
  stats[p[5]]++;
  stats[p[6]]++;
  stats[p[7]]++;
  p += 8;
}
while(p<pend) stats[*p++]++;

Don't count on the compiler to do it for you. It might or might not be able to figure it out.

Other possible optimizations come to mind, but they depend on what you're actually trying to do.

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Thanks Mike. Good advise. In fact i'm already doing unrolling. That's why i said this was just an "example", because the full code is a bit more complex, and doesn't bring anything useful to this discussion. – Cyan Mar 06 '11 at 20:10
  • @Cyan: The kinds of questions that come to mind are - Why the indirection array? How often does does the setup change? Could this be a candidate for precompiling? Stuff like that. – Mike Dunlavey Mar 07 '11 at 19:37
  • @Mike: Interesting points. What do you call "indirection array" ? Pre-compiling is a good optimization route. That's why i like the idea of attract about templates, which is one way pre-compile one of the arguments (as long as it has a finite number of values). – Cyan Mar 07 '11 at 21:20
  • @Cyan: Well in this case you're indexing `stats` by first indexing another array `index_array` and you're looping through that. That implies you want to only increment a subset of `stats` (because obviously the order doesn't matter). If setting up `index_array` is only done at low frequency, you might also do it by simply printing out a function to increment the desired members of `stats`, compile & link a dll on the fly, dynamically load it, and do your incrementing *really fast*. That's what I mean about what are you really trying to do? – Mike Dunlavey Mar 07 '11 at 22:37
  • ok; index_array is specific to your example, but does not reflect the code i'm working on. In my case, we are starting with a buffer, which is defined as a char* and a length. Then we go through the buffer, counting the char values in it. Therefore, values are between 0 and 255. – Cyan Mar 07 '11 at 23:42
  • @Cyan: Sounds like you've got it about right. Now, I believe in a particularly aggressive form of performance tuning. [Here's an example.](http://stackoverflow.com/questions/926266/performance-optimization-strategies-of-last-resort/927773#927773) Basically it involves an iteration of tuning steps of a kind that some people find natural, but most people find strange. Anyway, that's what I would do, in your shoes. You may already be close to optimal, but that would ensure it. – Mike Dunlavey Mar 08 '11 at 02:14
  • That's very interesting reading Mike. And yes, i'm after such gains. I can be very thorough in my testings, and loop many times to improve on each iteration. This sometimes ends up with stories such as yours, which is very pleasant reading. Best Regards – Cyan Mar 08 '11 at 07:29
1

If you have something like

int stats[256]; while (p<end) stats[*p++]++;

static int stats[256]; while (p<end) stats[*p++]++;

you are not really comparing the same thing because for the first instance you are not doing an initialization of your array. Written explicitly the second line is equivalent to

static int stats[256] = { 0 }; while (p<end) stats[*p++]++;

So to be a fair comparison you should have the first read

 int stats[256] = { 0 }; while (p<end) stats[*p++]++;

Your compiler might deduce much more things if he has the variables in a known state.

Now then, there could be runtime advantage of the static case, since the initialization is done at compile time (or program startup).

To test if this makes up for your difference you should run the same function with the static declaration and the loop several times, to see if the difference vanishes if your number of invocations grows.

But as other said already, best is to inspect the assembler that your compiler produces to see what effective difference there are in the code that is produced.

Jens Gustedt
  • 76,821
  • 6
  • 102
  • 177
  • OK, both versions are initialized on entering the function. I have not transcribed this part in the example, but it does start with a simple : " for (=0; i<256; i++) stats[i]=0; " This is the same for both, so no "initialization" difference. – Cyan Mar 06 '11 at 20:04