There is a small effect here. It's unlikely to be significant, but it is real. It will often take one extra instruction to resolve the extra level of indirection for a global pointer-to-a-buffer instead of a global array. For most uses, other considerations will be more important (like reuse of the same scratch space, vs giving each function its own scratch buffer). Also: avoiding compile-time size limits!
This effect is only present when you reference the global directly, rather than passing around the address as a function parameter. Inlining / whole-program link-time optimization may see all the way back to where the global is used as a function arg initially, and be able to take advantage of it throughout more code, though.
Let's compare simple test functions, compiled by clang 3.7 for x86-64 (SystemV ABI, so the first arg is in rdi
). Results on other architectures will be essentially the same:
int data_s[65536];
int *data_p;
void store_s(int val) { data_s[val] = val; }
movsxd rax, edi ; sign-extend
mov dword ptr [4*rax + data_s], eax
ret
void store_p(int val) { data_p[val] = val; }
movsxd rax, edi
mov rcx, qword ptr [rip + data_p] ; the extra level of indirection
mov dword ptr [rcx + 4*rax], eax
ret
Ok, so there's overhead of one extra load. (mov r64, [rel data_p]
). Depending on what other static/global objects data_p
is stored near, it may tend to stay hot in cache even if we're not using it often. If it's in a cache line with no other frequently-accessed data, it's wasting most of that cache line, though.
The overhead is only paid once per function call, though, even if there's a loop. (Unless the array is an array of pointers, since C aliasing rules require the compiler to assume that any pointer might be pointing to data_p
, unless it can prove otherwise. This is the main performance danger when using global pointers-to-pointers.)
If you don't use restrict
, the compiler still has to assume that the buffers pointed to by int *data_p1
and int *data_p2
could overlap, though, which interferes with autovectorization, loop unrolling, and many other optimizations. Static buffers can't overlap with other static buffers, but restrict
is still needed when using a static buffer and a pointer in the same loop.
Anyway, let's have a look at the code for very simple memset-style loops:
void loop_s(int val) { for (int i=0; i<65536; ++i) data_s[i] = val; }
mov rax, -262144 ; loop counter, counting up towards zero
.LBB3_1: # =>This Inner Loop Header: Depth=1
mov dword ptr [rax + data_s+262144], edi
add rax, 4
jne .LBB3_1
ret
Note that clang uses a non-RIP-relative effective address for data_s
here, because it can.
void loop_p(int val) { for (int i=0; i<65536; ++i) data_p[i] = val; }
mov rax, qword ptr [rip + data_p]
xor ecx, ecx
.LBB4_1: # =>This Inner Loop Header: Depth=1
mov dword ptr [rax + 4*rcx], edi
add rcx, 1
cmp rcx, 65536
jne .LBB4_1
ret
Note the mov rax, qword [rip + data_p]
in loop_p
, and the less efficient loop structure because it uses the loop counter as an array index.
gcc 5.3 has much less difference between the two loops: it gets the start address into a register and increments it, with a compare against the end address. So it uses a one-register addressing mode for the store, which may perform better on Intel CPUs. The only difference in loop structure / overhead for gcc is that the static buffer version gets the initial pointer into a register with a mov r32, imm32
, rather than a load from memory. (So the address is an immediate constant embedded in the instruction stream.)
In shared-library code, and on OS X, where all executables must be position-independent, gcc's way is the only option. Instead of mov r32, imm32
to get the address into a register, it would use lea r64, [RIP + displacement]
. The opportunity to save an instruction by embedding the absolute address into other instructions is gone when you need to offset the address by a variable amount (e.g. array index). If the array index is a compile-time constant, it can be folded into the displacement for a RIP-relative load or store instead of LEA. For a loop over an array, this could only happen with full unrolling, and is thus unlikely.
Still, the extra level of indirection is still there with a pointer to dynamically allocated memory. There's still a chance of a cache or TLB miss when doing a load instead of an LEA
.
Note that global data (as opposed to static
) has an extra level of indirection through the global offset table, but that's on top of the indirection or lack thereof from dynamic allocation. compiling with gcc 5.3 -fPIC
.
int global_data_s[65536];
int access_global_s(int i){return global_data_s[i];}
mov rax, QWORD PTR global_data_s@GOTPCREL[rip] ; load from a RIP-relative address, instead of an LEA
movsx rdi, edi
mov eax, DWORD PTR [rax+rdi*4] ; load, indexing the array
ret
int *global_data_p;
int access_global_p(int i){return global_data_p[i];}
mov rax, QWORD PTR global_data_p@GOTPCREL[rip] ; extra layer of indirection through the GOT
movsx rdi, edi
mov rax, QWORD PTR [rax] ; load the pointer (the usual layer of indirection)
mov eax, DWORD PTR [rax+rdi*4] ; load, indexing the array
ret
If I understand this correctly, the compiler doesn't assume that the symbol definition for global symbols in the current compilation unit are the definitions that will actually be used at link time. So the RIP-relative offset isn't a compile-time constant. Thanks to runtime dynamic linking, it's not a link-time constant either, so an extra level of indirection through the GOT is used. This is unfortunate, and I hope compilers on OS X don't have this much overhead for global variables. With -O0 -fwhole-program
on godbolt, I can see that even the globals are accessed with just RIP-relative addressing, not through the GOT, so perhaps that option is even more valuable than usual when making position-independent executables. Hopefully link-time-optimization works too, because that could be used when making shared libraries.
As many other answers have pointed out, there are other important factors, like memory locality, and the overhead of actually doing the allocate/free. These don't matter much for a large buffer (multiple pages) that's allocated once at program startup. See the comments on Peter A. Schneider's answer.
Dynamic allocation can give significant benefits, though, if you end up using the same memory as scratch space for multiple different things, so it stays hot in cache. Giving each function its own static buffer for scratch space is often a bad move if they aren't needed simultaneously: the dirty memory has to get written back to main memory when it's no longer needed, and is part of the program's footprint forever.
A good way to get small scratch buffers without the overhead of malloc
(or new
) is to create them on the stack (e.g. as local array variables). C99 allows variable-sized local arrays, like foo(int n) { int buf[n]; ...; }
Be careful not to overdo it and run out of stack space, but the current stack page is going to be hot in the TLB. The _local
functions in my godbolt links allocate a variable-sized array on the stack, which has some overhead for re-aligning the stack to a 16B boundary after adding a variable size. It looks like clang takes care to mask off the sign bit, but gcc's output looks like it will just break in fun and exciting ways if n
is negative. (In godbolt, use the "binary" button to get disassembler output, instead of the compiler's asm output, because the disassembly uses hex for immediate constants. e.g. clang's movabs rcx, 34359738352
is 0x7fffffff0
). Even though it takes a few instructions, it's much cheaper than malloc
. A medium to large allocation with malloc
, like 64kiB, will typically make an mmap
system call. But this is the cost of allocation, not the cost of accessing once allocated.
Having the buffer on the stack means the stack pointer itself is the base address for indexing into it. This means it doesn't take an extra register to hold that pointer, and it doesn't have to be loaded from anywhere.
If any elements are statically initialized to non-zero in a static (or global), the entire array or struct will be sitting there in the executable, which is a big waste of space if most entries should be zero at program startup. (Or if the data can be computed on the fly quickly.)
On some systems, having a gigantic array zero-initialized static array doesn't cost you anything as long as you never even read the parts you don't need. Lazy mapping of memory means the OS maps all the pages of your giant array to the same page of zeroed memory, and does copy-on-write. Taking advantage of this would be an ugly performance hack to be used only if you were sure you really wanted it, and were sure your code would never run on a system where your char data[1<<30]
would actually use that much memory right away.
Each memory access is equivalent to about 40 CPU clock cycles.
This is nonsense. The latency can be anywhere from 3 or 4 cycles (L1 cache hit) to multiple hundreds of cycles (main memory), or even a page fault requiring a disk access. Other than a page fault, much of this latency can overlap with other work, so the impact on throughput can be much lower. A load from a constant address can start as soon as the instruction issues into the out-of-order core, since it's the start of a new dependency chain. The size of the out-of-order window is limited (an Intel Skylake core has a Re-Order Buffer of 224 uops, and can have 72 loads in flight at once). A full cache miss (or worse, a TLB miss followed by a cache miss) often does stall out-of-order execution. See http://agner.org/optimize/, and other links in the x86 wiki. Also see this blog post about the impact of ROB size on how many cache misses can be in flight at once.
Out-of-order cores for other architectures (like ARM and PPC) are similar, but in-order cores suffer more from cache misses because they can't do anything else while waiting. (Big x86 cores like Intel and AMD's mainstream microarchitectures (not the low-power Silvermont or Jaguar microarchitectures) have more out-of-order execution resources than other designs, though. AFAIK, most ARM cores have significantly smaller buffers for starting independent loads early and/or hiding cache-miss latency.)