16

There are 2 ways of allocating global array in C:

  1. statically

    char data[65536];
    
  2. dynamically

    char *data;
    …
    data = (char*)malloc(65536);  /* or whatever size */
    

The question is, which method has better performance? And by how much?

As understand it, the first method should be faster.

Because with the second method, to access the array you have to dereference element's address each time it is accessed, like this:

  1. read the variable data which contains the pointer to the beginning of the array
  2. calculate the offset to specific element
  3. access the element

With the first method, the compiler hard-codes the address of the data variable into the code, first step is skipped, so we have:

  1. calculate the offset to specific element from fixed address defined at compile time
  2. access the element of the array

Each memory access is equivalent to about 40 CPU clock cycles, so , using dynamic allocation, specially for not frequent reads can have significant performance decrease vs static allocation because the data variable may be purged from the cache by some more frequently accessed variable. On the contrary , the cost of dereferencing statically allocated global variable is 0, because its address is already hard-coded in the code.

Is this correct?

5gon12eder
  • 24,280
  • 5
  • 45
  • 92
Nulik
  • 6,748
  • 10
  • 60
  • 129
  • 1
    Sounds about right. Although with position-independent code (which is currently in fashion) there's an extra calculation in both of those. – user253751 Jan 18 '16 at 05:12
  • 4
    Profile it and find out. – GManNickG Jan 18 '16 at 05:13
  • 1
    "*can have significant performance decrease*" This contradicts your assumption ("*not frequent reads*"). It will only have a significant performance decrease if you access it frequently enough for it to be a significant amount of your execution time. But if that's the case, then the address will be in cache anyway, which means its impact will be decreased. In theory the statically allocated array should be faster (especially if it rules out aliasing in the optimizer), but whether or not it's *significant* is dependent on a lot of things. – Cornstalks Jan 18 '16 at 05:16
  • 3
    [Do not cast the result of `malloc` in C](http://stackoverflow.com/q/605845/995714) – phuclv Jan 18 '16 at 05:21
  • what is casting has to do with memory allocation? – Nulik Jan 18 '16 at 05:51
  • @Nulik It has no consequences on the generated code and is not an error per se. However, it can *hide* bugs that your compiler would otherwise tell you about and it is not needed in C anyway so you can also safe typing. – 5gon12eder Jan 18 '16 at 05:55
  • The answer might depend on whether you have VA randomization and/or PIC enabled. Not entirely sure, but there may be some platform-specific effects on your assumptions. – Jeff Hammond Jan 18 '16 at 06:45
  • I'm quite confident that the additional(?) address computation is not noticeable in most cases simply because of good pipelinig. – Hagen von Eitzen Jan 18 '16 at 12:54
  • Possible duplicate of [Which is faster: Stack allocation or Heap allocation](http://stackoverflow.com/questions/161053/which-is-faster-stack-allocation-or-heap-allocation) – edmz Jan 18 '16 at 13:19
  • 1
    @black no, it is not duplicate. stack allocation is not global. – Nulik Jan 18 '16 at 13:49
  • @Nulik It is not, but conceptually and at assembly they're closely the same, i.e. plain `mov`-s vs. system calls and user space helpers. We'd be repeating the same thing. – edmz Jan 18 '16 at 14:01

5 Answers5

17

One should always benchmark to be sure. But, ignoring the effects of cache for the moment, the efficiency can depend on how sporadically you access the two. Herein, consider char data_s[65536] and char *data_p = malloc(65536)


If the access is sporadic the static/global will be slightly faster:

// slower because we must fetch data_p and then store
void
datasetp(int idx,char val)
{

    data_p[idx] = val;
}

// faster because we can store directly
void
datasets(int idx,char val)
{

    data_s[idx] = val;
}

Now, if we consider caching, datasetp and datasets will be about the same [after the first access], because the fetch of data_p will be fulfilled from cache [no guarantee, but likely], so the time difference is much less.


However, when accessing the data in a tight loop, they will be about the same, because the compiler [optimizer] will prefetch data_p at the start of the loop and put it in a register:

void
datasetalls(char val)
{
    int idx;

    for (idx = 0;  idx < 65536;  ++idx)
        data_s[idx] = val;
}

void
datasetallp(char val)
{
    int idx;

    for (idx = 0;  idx < 65536;  ++idx)
        data_p[idx] = val;
}

void
datasetallp_optimized(char val)
{
    int idx;
    register char *reg;

    // the optimizer will generate the equivalent code to this
    reg = data_p;

    for (idx = 0;  idx < 65536;  ++idx)
        reg[idx] = val;
}

If the access is so sporadic that data_p gets evicted from the cache, then, the performance difference doesn't matter so much, because access to [either] array is infrequent. Thus, not a target for code tuning.

If such eviction occurs, the actual data array will, most likely, be evicted as well.

A much larger array might have more of an effect (e.g. if instead of 65536, we had 100000000, then mere traversal will evict data_p and by the time we reached the end of the array, the leftmost entries would already be evicted.

But, in that case, the extra fetch of data_p would be 0.000001% overhead.

So, it helps to either benchmark [or model] the particular use case/access pattern.


UPDATE:

Based on some further experimentation [triggered by a comment by Peter], the datasetallp function does not optimize to the equivalent of datasetallp_optimized for certain conditions, due to strict aliasing considerations.

Because the arrays are char [or unsigned char], the compiler generates a data_p fetch on each loop iteration. Note that if the arrays are not char (e.g. int), the optimization does occur and data_p is fetched only once, because char can alias anything but int is more limited.

If we change char *data_p to char *restrict data_p we get the optimized behavior. Adding restrict tells the compiler that data_p will not alias anything [even itself], so it's safe to optimize the fetch.


Personal note: While I understand this, to me, it seems goofy that without restrict, the compiler must assume that data_p can alias back to itself. Although I'm sure there are other [equally contrived] examples, the only ones I can think of are data_p pointing to itself or that data_p is part of a struct:

// simplest
char *data_p = malloc(65536);
data_p = (void *) &data_p;

// closer to real world
struct mystruct {
    ...
    char *data_p;
    ...
};
struct mystruct mystruct;
mystruct.data_p = (void *) &mystruct;

These would be cases where the fetch optimization would be wrong. But, IMO, these are distinguishable from the simple case we've been dealing with. At least, the struct version. And, if a programmer should do the first one, IMO, they get what they deserve [and the compiler should allow fetch optimization].


For myself, I always hand code the equivalent of datasetallp_optimized [sans register], so I usually don't see the multifetch "problem" [if you will] too much. I've always believed in "giving the compiler a helpful hint" as to my intent, so I just do this axiomatically. It tells the compiler and another programmer that the intent is "fetch data_p only once".


Also, the multifetch problem does not occur when using data_p for input [because we're not modifying anything, aliasing is not a consideration]:

// does fetch of data_p once at loop start
int
datasumallp(void)
{
    int idx;
    int sum;

    sum = 0;
    for (idx = 0;  idx < 65536;  ++idx)
        sum += data_p[idx];

    return sum;
}

But, while it can be fairly common, "hardwiring" a primitive array manipulation function with an explicit array [either data_s or data_p] is often less useful than passing the array address as an argument.

Side note: clang would optimize some of the functions using data_s into memset calls, so, during experimentation, I modified the example code slightly to prevent this.

void
dataincallx(array_t *data,int val)
{
    int idx;

    for (idx = 0;  idx < 65536;  ++idx)
        data[idx] = val + idx;
}

This does not suffer from the multifetch problem. That is, dataincallx(data_s,17) and dataincallx(data_p,37) work about the same [with the initial extra data_p fetch]. This is more likely to be what one might use in general [better code reuse, etc].


So, the distinction between data_s and data_p becomes a bit more of a moot point. Coupled with judicious use of restrict [or using types other than char], the data_p fetch overhead can be minimized to the point where it isn't really that much of a consideration.

It now comes down more to architectural/design choices of choosing a fixed size array or dynamically allocating one. Others have pointed out the tradeoffs.

This is use case dependent.

If we had a limited number of array functions, but a large number of different arrays, passing the array address to the functions is a clear winner.

However, if we had a large number of array manipulation functions and [say] one array (e.g. the [2D] array is a game board or grid), it might be better that each function references the global [either data_s or data_p] directly.

Craig Estey
  • 30,627
  • 4
  • 24
  • 48
  • 1
    If it's an array of pointers, looping over `data_p[i]` might have to re-load the pointer every iteration because the aliasing rules require the compiler *not* to assume that `data_p[idx]` doesn't point to `&data_p`. Your "optimized" version avoids that problem. See http://goo.gl/bSTuKR for an example (with silly casts from char to `int*` after copy/pasting your code). clang can vectorize the loops other than the `data_p` loop, where it does in fact re-load `data_p` after every store. That doesn't happen with `uint64_t`, though, because even pointer-size ints can't alias pointers. – Peter Cordes Jan 18 '16 at 10:44
  • How can storing to `data_p [idx]` change the value of `data_p` thus requiring reloading on every store? – JDługosz Jan 18 '16 at 12:03
  • @JDługosz Storing to `*data_p[idx]` can change `data_p`, or even reading `data_p[idx]` and passing it as an argument to a function can do it – ciamej Jan 18 '16 at 12:09
  • But that's not what he's doing. Neither is the code you are showing that reloads each time. `data_p [idx] = val;`. – JDługosz Jan 18 '16 at 12:11
  • @JDługosz Hmm, this can happen when `data_p < &data_p < data_p + n` – ciamej Jan 18 '16 at 12:15
  • @JDługosz: It can only happen with an array of pointers, e.g. `int **data_p`, not with `int *data_p` like the OP's example. See the godbolt link in my first comment. C allows pointers of different types to alias each other, so a store to an `int *` might affect the value of an `int **`. – Peter Cordes Jan 18 '16 at 22:09
  • Craig: I started with a copy/paste of your code for my answer, so thanks for the starting point. I took a slightly different approach to explaining the overhead than your answer, with x86 examples. – Peter Cordes Jan 18 '16 at 22:41
  • @PeterCordes Use the code with my compliments. Based on your comment about aliasing, I did some experimentation and added an update to my answer. In your answer, I'm glad you added the part about out-of-order, etc. I was considering that yesterday, myself, as I also felt the "40 cycles" was ... – Craig Estey Jan 19 '16 at 01:59
5

Calculating offsets is not a big performance issue. You have to consider how you will actually use the array in your code. You'll most likely write something like data[i] = x; and then no matter where data is stored, the program has to load a base address and calculate an offset.

The scenario where the compiler can hard code the address in case of the statically allocated array only happens when you write something like data[55] = x; which is probably a far less likely use case.

At any rate we are talking about a few CPU ticks here and there. It's not something you should go chasing by attempting manual optimization.

Each memory access is equivalent to about 40 CPU clock cycles

What!? What CPU is that? Some pre-ancient computer from 1960?

Regarding cache memory, those concerns may be more valid. It is possible that statically allocated memory utilizes data cache better, but that's just speculation and you'd have to have a very specific CPU in mind to have that discussion.


There is however a significant performance difference between static and dynamic allocation, and that is the allocation itself. For each call to malloc there is a call to the OS API, which in turn runs search function going through the heap and looking for for a free segment. The library also needs to keep track of the address to that segment internally, so that when you call free() it knows how much memory to release. Also, the more you call malloc/free, the more segmented the heap will become.

Lundin
  • 195,001
  • 40
  • 254
  • 396
  • 1
    Static allocation can give a code-size advantage on x86, where the base+offset calculation can be done by embedding the address of `data` as a displacement in the effective address. e.g. `mov [data + rdx], eax`, if `i` is in `rdx` and `int x` is in `eax`. If `data` is a pointer, rather than an array, it would take a sequence like `mov rcx, [data]` / `mov [rcx + rdx], eax`. So the compiler can hard-code the address regardless of the index being a compile-time constant, on x86. – Peter Cordes Jan 18 '16 at 10:33
  • 2
    Also, 40 cycles for a memory address is maybe about right for a hit in L3 cache. But that's just latency, which might not matter if it's not part of a long dependency chain. And loading the `data` pointer can start early, because the address is a constant. But anyway, 40 cycles is a weird number to just make up, since an L1 hit is more like 4 cycles, while main memory is hundreds of cycles. – Peter Cordes Jan 18 '16 at 10:35
  • 1
    Many CPU cycles for memeory access are rather an indication of short CPU cycles, i.e., a fast and modern CPU (compared to the memory) – Hagen von Eitzen Jan 18 '16 at 12:52
4

I think that data locality is much more of an issue than computing the base address of the array. (I could imagine cases where accessing the pointer contents is extremely fast because it is in a register while the offset to the stack pointer or text segment is a compile time constant; accessing a register may be faster.)

But the real issue will be data locality, which is often a reason to be careful with dynamic memory in performance critical tight loops. If you have more dynamically allocated data which happens to be close to your array, chances are the memory will remain in the cache. If you have data scattered all over the RAM allocated at different times, you may have many cache misses accessing them. In that case it would be better to allocate them statically (or on the stack) next to each other, if possible.

Peter - Reinstate Monica
  • 15,048
  • 4
  • 37
  • 62
  • cache lines are only 64B. Page locality matters for the TLB, though, and accessing adjacent cache lines helps the prefetcher a lot. – Peter Cordes Jan 18 '16 at 19:17
  • @PeterCordes Yes, but the overall cache size is in the kB (L1/2) or MB(L3) range. – Peter - Reinstate Monica Jan 18 '16 at 20:28
  • I guess you're thinking of really tiny scattered allocations, wasting most of the cache line. Since the OP was talking about decent-sized arrays, this won't be a problem. If your allocations are multiples of the cache-line size, the caches can still hold as much of your data (unless the addresses alias each other, e.g. they're all separated by 4k, since real caches have limited associativity.) Scattered small allocations = very bad. Scattered medium-size allocations: bad for TLB and prefetch. Scattered large allocations: might stop opportunistic hugepage coalescing. – Peter Cordes Jan 18 '16 at 20:38
  • 1
    @PeterCordes You are right, I had the typical case of many small allocations all over the place in the back of my head, but that is not the scenario here with 65k. – Peter - Reinstate Monica Jan 18 '16 at 21:06
3

There is a small effect here. It's unlikely to be significant, but it is real. It will often take one extra instruction to resolve the extra level of indirection for a global pointer-to-a-buffer instead of a global array. For most uses, other considerations will be more important (like reuse of the same scratch space, vs giving each function its own scratch buffer). Also: avoiding compile-time size limits!

This effect is only present when you reference the global directly, rather than passing around the address as a function parameter. Inlining / whole-program link-time optimization may see all the way back to where the global is used as a function arg initially, and be able to take advantage of it throughout more code, though.


Let's compare simple test functions, compiled by clang 3.7 for x86-64 (SystemV ABI, so the first arg is in rdi). Results on other architectures will be essentially the same:

int data_s[65536];
int *data_p;

void store_s(int val) {  data_s[val] = val; }
        movsxd  rax, edi                        ; sign-extend
        mov     dword ptr [4*rax + data_s], eax
        ret

void store_p(int val) {  data_p[val] = val; }
        movsxd  rax, edi
        mov     rcx, qword ptr [rip + data_p]   ; the extra level of indirection
        mov     dword ptr [rcx + 4*rax], eax
        ret

Ok, so there's overhead of one extra load. (mov r64, [rel data_p]). Depending on what other static/global objects data_p is stored near, it may tend to stay hot in cache even if we're not using it often. If it's in a cache line with no other frequently-accessed data, it's wasting most of that cache line, though.

The overhead is only paid once per function call, though, even if there's a loop. (Unless the array is an array of pointers, since C aliasing rules require the compiler to assume that any pointer might be pointing to data_p, unless it can prove otherwise. This is the main performance danger when using global pointers-to-pointers.)

If you don't use restrict, the compiler still has to assume that the buffers pointed to by int *data_p1 and int *data_p2 could overlap, though, which interferes with autovectorization, loop unrolling, and many other optimizations. Static buffers can't overlap with other static buffers, but restrict is still needed when using a static buffer and a pointer in the same loop.

Anyway, let's have a look at the code for very simple memset-style loops:

void loop_s(int val) { for (int i=0; i<65536; ++i) data_s[i] = val; }
        mov     rax, -262144      ; loop counter, counting up towards zero
.LBB3_1:                                # =>This Inner Loop Header: Depth=1
        mov     dword ptr [rax + data_s+262144], edi
        add     rax, 4
        jne     .LBB3_1
        ret

Note that clang uses a non-RIP-relative effective address for data_s here, because it can.

void loop_p(int val) { for (int i=0; i<65536; ++i) data_p[i] = val; }
        mov     rax, qword ptr [rip + data_p]
        xor     ecx, ecx
.LBB4_1:                                # =>This Inner Loop Header: Depth=1
        mov     dword ptr [rax + 4*rcx], edi
        add     rcx, 1
        cmp     rcx, 65536
        jne     .LBB4_1
        ret

Note the mov rax, qword [rip + data_p] in loop_p, and the less efficient loop structure because it uses the loop counter as an array index.

gcc 5.3 has much less difference between the two loops: it gets the start address into a register and increments it, with a compare against the end address. So it uses a one-register addressing mode for the store, which may perform better on Intel CPUs. The only difference in loop structure / overhead for gcc is that the static buffer version gets the initial pointer into a register with a mov r32, imm32, rather than a load from memory. (So the address is an immediate constant embedded in the instruction stream.)


In shared-library code, and on OS X, where all executables must be position-independent, gcc's way is the only option. Instead of mov r32, imm32 to get the address into a register, it would use lea r64, [RIP + displacement]. The opportunity to save an instruction by embedding the absolute address into other instructions is gone when you need to offset the address by a variable amount (e.g. array index). If the array index is a compile-time constant, it can be folded into the displacement for a RIP-relative load or store instead of LEA. For a loop over an array, this could only happen with full unrolling, and is thus unlikely.

Still, the extra level of indirection is still there with a pointer to dynamically allocated memory. There's still a chance of a cache or TLB miss when doing a load instead of an LEA.

Note that global data (as opposed to static) has an extra level of indirection through the global offset table, but that's on top of the indirection or lack thereof from dynamic allocation. compiling with gcc 5.3 -fPIC.

int global_data_s[65536];
int access_global_s(int i){return global_data_s[i];}
    mov     rax, QWORD PTR global_data_s@GOTPCREL[rip]      ; load from a RIP-relative address, instead of an LEA
    movsx   rdi, edi
    mov     eax, DWORD PTR [rax+rdi*4]       ; load, indexing the array
    ret

int *global_data_p;
int access_global_p(int i){return global_data_p[i];}
    mov     rax, QWORD PTR global_data_p@GOTPCREL[rip]      ; extra layer of indirection through the GOT
    movsx   rdi, edi
    mov     rax, QWORD PTR [rax]           ; load the pointer (the usual layer of indirection)
    mov     eax, DWORD PTR [rax+rdi*4]     ; load, indexing the array
    ret

If I understand this correctly, the compiler doesn't assume that the symbol definition for global symbols in the current compilation unit are the definitions that will actually be used at link time. So the RIP-relative offset isn't a compile-time constant. Thanks to runtime dynamic linking, it's not a link-time constant either, so an extra level of indirection through the GOT is used. This is unfortunate, and I hope compilers on OS X don't have this much overhead for global variables. With -O0 -fwhole-program on godbolt, I can see that even the globals are accessed with just RIP-relative addressing, not through the GOT, so perhaps that option is even more valuable than usual when making position-independent executables. Hopefully link-time-optimization works too, because that could be used when making shared libraries.


As many other answers have pointed out, there are other important factors, like memory locality, and the overhead of actually doing the allocate/free. These don't matter much for a large buffer (multiple pages) that's allocated once at program startup. See the comments on Peter A. Schneider's answer.

Dynamic allocation can give significant benefits, though, if you end up using the same memory as scratch space for multiple different things, so it stays hot in cache. Giving each function its own static buffer for scratch space is often a bad move if they aren't needed simultaneously: the dirty memory has to get written back to main memory when it's no longer needed, and is part of the program's footprint forever.

A good way to get small scratch buffers without the overhead of malloc (or new) is to create them on the stack (e.g. as local array variables). C99 allows variable-sized local arrays, like foo(int n) { int buf[n]; ...; } Be careful not to overdo it and run out of stack space, but the current stack page is going to be hot in the TLB. The _local functions in my godbolt links allocate a variable-sized array on the stack, which has some overhead for re-aligning the stack to a 16B boundary after adding a variable size. It looks like clang takes care to mask off the sign bit, but gcc's output looks like it will just break in fun and exciting ways if n is negative. (In godbolt, use the "binary" button to get disassembler output, instead of the compiler's asm output, because the disassembly uses hex for immediate constants. e.g. clang's movabs rcx, 34359738352 is 0x7fffffff0). Even though it takes a few instructions, it's much cheaper than malloc. A medium to large allocation with malloc, like 64kiB, will typically make an mmap system call. But this is the cost of allocation, not the cost of accessing once allocated.

Having the buffer on the stack means the stack pointer itself is the base address for indexing into it. This means it doesn't take an extra register to hold that pointer, and it doesn't have to be loaded from anywhere.


If any elements are statically initialized to non-zero in a static (or global), the entire array or struct will be sitting there in the executable, which is a big waste of space if most entries should be zero at program startup. (Or if the data can be computed on the fly quickly.)

On some systems, having a gigantic array zero-initialized static array doesn't cost you anything as long as you never even read the parts you don't need. Lazy mapping of memory means the OS maps all the pages of your giant array to the same page of zeroed memory, and does copy-on-write. Taking advantage of this would be an ugly performance hack to be used only if you were sure you really wanted it, and were sure your code would never run on a system where your char data[1<<30] would actually use that much memory right away.


Each memory access is equivalent to about 40 CPU clock cycles.

This is nonsense. The latency can be anywhere from 3 or 4 cycles (L1 cache hit) to multiple hundreds of cycles (main memory), or even a page fault requiring a disk access. Other than a page fault, much of this latency can overlap with other work, so the impact on throughput can be much lower. A load from a constant address can start as soon as the instruction issues into the out-of-order core, since it's the start of a new dependency chain. The size of the out-of-order window is limited (an Intel Skylake core has a Re-Order Buffer of 224 uops, and can have 72 loads in flight at once). A full cache miss (or worse, a TLB miss followed by a cache miss) often does stall out-of-order execution. See http://agner.org/optimize/, and other links in the wiki. Also see this blog post about the impact of ROB size on how many cache misses can be in flight at once.

Out-of-order cores for other architectures (like ARM and PPC) are similar, but in-order cores suffer more from cache misses because they can't do anything else while waiting. (Big x86 cores like Intel and AMD's mainstream microarchitectures (not the low-power Silvermont or Jaguar microarchitectures) have more out-of-order execution resources than other designs, though. AFAIK, most ARM cores have significantly smaller buffers for starting independent loads early and/or hiding cache-miss latency.)

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    ARM has fewer buffers because its original market was: CPU + custom logic in a [small] FPGA for "smart" controller applications. Intel/AMD has always ridden the "high end" PC curve. Only since ARM on cell phones [and soon servers] has it jumped into the high end. Intel is ahead a generation or two in process tech [# of gates], so has more gates avail to do sophisticated things. Intel has been tuning its cache/re-order/etc stuff for decades. ARM is new to this and playing catchup in both gate count and designing the logic to implement it. – Craig Estey Jan 19 '16 at 05:33
  • @CraigEstey: yep, exactly. Intel's aggressive out-of-order designs are appropriate for desktop/server CPUs, and high-power mobile, but not for the power budgets of most ARM cores. AFAIK, ARM cores are well designed for their power budget / transistor budget (die area = mfg cost), and make good tradeoffs for their design goals. Intel's process advantage is shrinking a bit, as they've had to back off from their roadmap projections of further shrinks. (e.g. Skylake followed by another 14nm design, instead of a shrink to 10nm gate width.) – Peter Cordes Jan 19 '16 at 06:09
  • Some people on http://realworldtech.com/ forums a few months ago were claiming that Intel's process advantage is nearly gone, but that's probably still an overstatement from people being overly optimistic about ARM displacing x86 for servers in the near future. Anyway, I wrote the last paragraph the way I did because I wasn't sure if someone had maybe come along and designed a really aggressive OOO ARM core that comes close to Intel's top-end designs. There's [AMD's new ARM Opteron](http://arstechnica.com/information-technology/2016/01/amds-datacenter-arm-processors-finally-hit-the-market/) – Peter Cordes Jan 19 '16 at 06:12
  • Intel uses a [so called] "tick tock" strategy [since 2007]. Each "tick" is a die shrink. Each "tock" is a logic/architecture improvement. Yep, they're slowing down. But, it's a double risk to shrink _and_ change the logic (e.g. remember the `fdiv` bug?). ARM's low power is what is making it attractive for data centers [less waste heat]. Also, data centers are as much memory/disk bound, so, architecturally, many more simpler ARM cores (e.g. [say] 8 ARMs replace one x86), ala what RAID did for disks. – Craig Estey Jan 19 '16 at 06:32
  • Intel's also going the other way: partnering [with Altera(?), IMO] to add an FPGA to an Intel CPU package. Imagine tightly coupled logic that can interact with the CPU on a cycle-by-cycle basis. Imagine a merge sort using the "Batcher" algorithm, done inside the FPGA. The FPGA microload being able to be reloaded on a context switch, just like the floating point unit--yum. Intel is _still_ ahead, partly, because its hafnium based "3D" process tech is farther along. Of course, anybody making a breakthrough in graphene or optical will be a big player. – Craig Estey Jan 19 '16 at 06:39
  • My favorite dark horse is HP and it's "memristor" tech [which they keep pushing out the delivery dates on]. Non-volatile ram at the pricing and density of flash, but access time of L2 cache. They're designing an entire new SOC around this (i.e. no more need for paging to disk) – Craig Estey Jan 19 '16 at 06:43
  • @CraigEstey: oh right, yeah I was temporarily forgetting that the workloads its designed for can saturate memory with multiple cores, rather than many in-flight ops from fewer cores. As you say, server / datacenter workloads are usually highly parallel. I think POWER (designed more for scientific computing) might be a bit more aggressively OOO than ARM, but only up to the "sweet spot" where more cores is better. Intel cares more about interactive use (still often bound by single-thread throughput) than other designers, so they go really aggressively OOO, more than many-core. – Peter Cordes Jan 19 '16 at 06:43
  • @Craig: there is a lot of cool stuff potentially coming in the future. Big *and* fast memory that's cheap and low-power enough to use will be a game-changer. I'm also looking fwd to FPGAs. There are a lot of bit-twiddling things that FPGAs can do much more efficiently than CPUs, including crypto, Galois multiplies (for error-correction codes), and probably some video-encode acceleration. Designing good APIs to minimize context-switch save&restore of FPGA state will be interesting. (e.g. good decisions on when *not* to use the FPGA if it's already in use and there's a decent SIMD fallback..) – Peter Cordes Jan 19 '16 at 06:50
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/101040/discussion-between-craig-estey-and-peter-cordes). – Craig Estey Jan 19 '16 at 06:50
  • 1
    Maybe I missed it in your answer but it's OS dependent and not just SystemV ABI depedent. OSX does not allow absolute 32-bit addressing anyway so it will use RIP relative addressing anyway. So I think parts of your answer would change if this was for OSX and not Linux. – Z boson Jan 19 '16 at 21:38
  • @Zboson: good point. That applies when making position-independent code for Linux, too. Updated my answer. I'm still not 100% sure I understand why the compiler needs to do indirect accesses through the GOT for globals defined in the current compilation unit. I'm guessing it's because another definition could take precedence at run-time. – Peter Cordes Jan 19 '16 at 23:08
-2

I would say you really should profile it. Theoretically you are right but there are some basic things you have to remember.

Language C is a high-level language like many there exist today and you tell the machine what to do. Getting closer to machine code would be considering ASM or similar. If you build code, through compiling and linking or whatever, the compiler will try the best to correctly run what you demand and optimize it (unless you don't want that). Remember, there also exist concepts like Just-In-Time compilation (JIT).

So I consider it hard to answer your question. For one thing you can be sure. A static array will most likely be faster especially with the size of 65536 because there are more chances of optimization for the compiler. This might depend on what size you defined. For GCC 65536 bytes seems to be common for stacks and caches, not sure. Some compilers might even tell you the array is too big, because they try to keep it in other memory hierarchies like caches which also are faster than Random Access Memory.

Last but not least remember that modern operating systems also have their memory management using virtual memory.

Static memory can be stored in data segments and will most likely be loaded when the program is executed, but remember this is also time you have to consider. Allocate the memory by the OS when the program is started or do it at runtime? It really depends on your application.

So I think you really should benchmark your results and see by how much faster it is. But as tendency I would say your static array will compile a code that is going to run faster.

Gerhard Stein
  • 1,543
  • 13
  • 25
  • 1
    Statically allocated memory like this doesn't necessarily live on the stack. They generally live in the data segment, which is a special place. That's why you can allocate massive global static arrays, but not massive local static arrays. – Cornstalks Jan 18 '16 at 05:28
  • 2
    Also, you talk a lot about the cost of allocating the memory, but the question is about accessing the memory, which is different. – Cornstalks Jan 18 '16 at 05:31
  • 2
    I don't get what you're trying to say with “C is a high-level language like many there exist today” and your reference to JIT. I think C is pretty much as close as you can get to the hardware if you're not going to write assembly by hand. And a C compiler will not be allowed to replace your statically allocated array with a heap allocation as the latter might fail so it would break the *as if* rule. – 5gon12eder Jan 18 '16 at 05:34
  • 2
    `I would say you really should profile it.` That's the right answer indeed, and had already been proposed in the very 2nd comment. Other than that, can you point to even _one_ sentence in your long post which at least _attempts_ to answer the original question? – dxiv Jan 18 '16 at 05:55
  • I don't think allocating static memory is necessarily faster or better then allocating it dynamically. The point is that, for example, if I have a static array of 10 integer elements but I only use 5 position it is a waste of memory while using malloc and realloc you are somewhat capable of optimising the memory as you need it. I'm not a real expert on this by the way, that's just what I think. – Claudio Cortese Jan 18 '16 at 06:05
  • Profiling is less useful than inspecting the generated code to see if offset calculations are the same or not. – Jeff Hammond Jan 18 '16 at 06:46
  • @5gon12eder It is high level, because the instructions generated is a decision the compiler takes, while assembly translates your direct instructions to machine code (low-level). Eventhough you get very close to the hardware there is still a layer of translation/compilation. It has to exist, otherwise portability would be too hard. – Gerhard Stein Jan 18 '16 at 16:24