2

My program allocates all of its resources which is slightly below 1MB in startup and no more, except primitive local variables. The allocation took place originally by malloc, so on the heap, but I wondered whether there will be any difference by putting them on the stack.

In various tests with program runtime from 3 seconds to 3 minutes. Accessing the stack steadily appears to be faster up to 10%. All I changed was whether to malloc the structs or to declare them as automatic variables.

Another interesting fact I found is that when I declare the objects as static. The program will run 20~30% slower. I have no idea why. I double checked whether I made a mistake but the only difference really is whether static is there or not. Do static variables go somewhere else in the stack than automatic variables?

Before I had quite an opposite experience that in a C++ class, when I made a const member array from non-static to static, the program did run faster. The memory consumption was same because there was only one instance of that object.

Is program runtime affected by where the objects reside in the memory? Even if so, can't the compiler manage to place the objects in the right place for maximum efficiency?

  • Did you try you tests with various optimization levels ? – Serge Ballesta Aug 24 '15 at 10:22
  • @SergeBallesta With gcc 4.9.2, the optimization option I use is always `-flto -fwhole-program -s -O3`. –  Aug 24 '15 at 10:24
  • `malloc` has to find the next sufficiently large memory block and mark it as taken. The stack doesn't. The `static` slowdown could be because of caching issues, impossible to tell without the code. – nwp Aug 24 '15 at 10:26
  • @nwp As I said in the post `malloc` is called only once (per object) at startup. The call to `malloc` itself doesn't make any significant difference. –  Aug 24 '15 at 10:31
  • 2
    Local `static` variables reside on the heap just like global variables. – JimmyB Aug 24 '15 at 10:33
  • Global automatic structs are referenced by their constant, known-at-compile-time address, whereas malloc'd memory is always accessed indirectly via pointers. – JimmyB Aug 24 '15 at 10:36
  • When I read question title once thing immediately come to my mind: NUMA ..but sadly, the question does not seem to be about it :) – quetzalcoatl Aug 24 '15 at 10:40
  • 2
    *"Local `static` variables reside on the heap just like global variables."* - wrong on both counts (and two upvotes, weird)... see https://en.wikipedia.org/wiki/Data_segment – Tony Delroy Aug 24 '15 at 10:45
  • Overlapping question: http://stackoverflow.com/questions/24057331/is-accessing-data-in-the-heap-faster-than-from-the-stack/24057744#24057744 – Tony Delroy Aug 24 '15 at 10:48
  • @HannoBinder *"Global automatic structs"* is a contradiction in terms - *automatic* storage duration [basic.stc.auto 3.7.3] is the term used for stack based data, never globals which have static storage duration by default (and thread storage duration if marked up for such). – Tony Delroy Aug 24 '15 at 10:53
  • @TonyD Yes I saw that question, but the answer is much different from what I actually see. It only says about the overhead of dynamic allocation itself, but that's really irrelevant to my situation. I see performance difference from memory 'access', about which in the answer in your link says there is no such difference. –  Aug 24 '15 at 10:55
  • @xiver77: just a word of caution - doing meaningful benchmarking is not as easy as it might seem, and it's quite possible you're generating misleading timings - misinterpreting their import. With you not showing your code, I'm inclined to say that's more likely than the performance differences being real. For example: alternate timing one approach with the other a few times and see how stable/consistent the numbers are after caches/buffers have warmed up; and beware timing with `QueryPerformanceCounter` or `RDTSC` unless you've set thread affinity.... – Tony Delroy Aug 24 '15 at 11:04
  • @TonyD Well I really can't post several thousand lines of code here. It's not easy to reproduce the result with small sample code. I've already done a lot of testing with enough care. –  Aug 24 '15 at 11:09
  • @xiver77: well, it's up to you to decide whether to do the work to minimally reproduce the problem, but for now I'll be voting to close as off topic due to lack of code to reproduce the problem. There are just too many subtle performance factors you could easily be unaware of (e.g. `zero initialisation` before default construction performed by `new T()`) - and we can't discuss them with you in a targeted fashion without code. – Tony Delroy Aug 24 '15 at 11:21
  • Always a good read: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf – dau_sama Aug 24 '15 at 12:01
  • @TonyD: That was my bad. I originally misread your comment. – code_dredd Aug 25 '15 at 02:16

2 Answers2

2

Well, yeah, program performance is affected by where objects reside in memory.

The problem is, unless you have intimate knowledge of how your compiler works and how it uses features of your particular host system (operating system services, hardware, processor cache, etc), and how those things are configured, you will not be able to consistently exploit it. Even if you do succeed, small changes (e.g. upgrading a compiler, changing optimisation settings, a change of process quotas, changing amount of physical memory, changing a hard drive [e.g. that is used for swap space]) can affect performance, and it won't always be easy to predict whether a change will improve or degrade performance). Performance is sensitive to all these things - and to the interaction between them, in ways that are not always obvious without close analysis.

Peter
  • 35,646
  • 4
  • 32
  • 74
0

Is program runtime affected by where the objects reside in the memory?

Yes, program performance will be affected by the location of an object in memory, among other factors.

Whenever an object in the heap is accessed, it's done by dereferencing a pointer. Dereferencing pointers requires additional computation to find the next memory address and every additional pointer that's between you and your data (e.g. pointer-> pointer -> actual data) will make it slightly worse. In addition to this, it increases the cache miss rate for the CPU because the data that it expects to access next is not really in contiguous memory blocks. In other words, the assumption the CPU made to try and optimize its pipeline turns out to be false, and it pays the penalty for an incorrect prediction.

Even if so, can't the compiler manage to place the objects in the right place for maximum efficiency?

The C/C++ compilers will place the objects at whatever location you tell it to. If you're using malloc, you shouldn't expect the compiler to put things on the stack, and vice-versa.

The compiler wouldn't be able to do this even in principle because malloc dynamically allocates memory at runtime, long after the compiler's job has ended. What is known at compile time is the size for the stack, but how the contents of this memory are organized depends entirely on you.

You might be able to get some benefits from compiler optimization settings, but most of your benefits in optimization efforts will be better spent improving data structures and/or algorithms being used.

code_dredd
  • 5,915
  • 1
  • 25
  • 53
  • 1
    "Whenever an object in the heap is accessed, it's done by dereferencing a pointer." Yes indeed, but the same is true of objects on the stack. – TonyK Aug 24 '15 at 18:04
  • @TonyK Then what is likely to make the diffrence in access time when pointer dereferencing is still required for stack objects? Your statement seems obvious though in that large objects will not simply fit in the registers. –  Aug 24 '15 at 18:15
  • As I mentioned in the post and comments I am not playing with compilation options, but rather with where my initial 1MB of resources belong best. In the stack? heap? data segment? –  Aug 24 '15 at 18:17
  • @xiver77: It might be a consequence of your specific machine architecture. Perhaps there are more options available for indexing through the Stack Pointer than through a general pointer. But really all you can do is try the different options and benchmark them carefully. – TonyK Aug 24 '15 at 18:52
  • @TonyK: "the same is true of objects on the stack". This is not necessarily true. Consider `int *p = (int*) malloc(sizeof(int)); *p = 5;` vs `int i = 5;`. Obviously `p` has a heap address that needs dereferencing to access the value while the value in `i` is in the stack and no dereferencing is required b/c `i` is not a pointer. However, nothing prevents you from doing `int i = 5; int *p = &i; *p = 6;`, but using pointers to access stack data is probably not their best use (i.e. in complicated code, someone might try to `free` it by mistake). – code_dredd Aug 24 '15 at 21:03
  • The OP said "slighty below 1MB", so your comment is irrelevant. – TonyK Aug 24 '15 at 22:52
  • @ray I de-selected your answer because of your last comment. `int i = 5` will not require memory access (pointer dereferencing) because the value will most likely be stored in a register. Any larger object that does not fit in the register will always require memory access through a stack pointer or through an arbitrary address returned by `malloc`. –  Aug 25 '15 at 21:34
  • @xiver77: It will need to be fetched before it finds its way into the register, so it does require memory access at some point or another, and I explicitly mentioned that `i` is not a pointer, so it does not need to be dereferenced. I can clarify my example if needed. – code_dredd Aug 25 '15 at 22:30
  • @xiver77: Also, for the particular example I mentioned, `i` is simply an `int` and not a large object. Obviously, if you change the example as you've done, more things might need to change in my description. – code_dredd Aug 25 '15 at 22:33