The performance of the two approach is nearly the same because the inner-most loop can optimized the same way in both cases and the computation is likely memory-bound. This means the overhead of the indirection is diluted in the rest of the computation which take most of the time and is subject to variations that can actually be bigger than the overhead. Thus the benchmark is not very sensitive to the difference between the two methods. Here is the assembly code of the inner-most loop (left side: matrix, right side: array of array):
.LBB0_17: .LBB1_30:
movdqa xmm5, xmm1 movdqa xmm5, xmm1
paddq xmm5, xmm4 paddq xmm5, xmm4
movdqa xmm6, xmm0 movdqa xmm6, xmm0
paddq xmm6, xmm4 paddq xmm6, xmm4
shufps xmm5, xmm6, 136 shufps xmm5, xmm6, 136
movdqa xmm6, xmm3 movdqa xmm6, xmm3
paddq xmm6, xmm1 paddq xmm6, xmm1
movdqa xmm7, xmm3 movdqa xmm7, xmm3
paddq xmm7, xmm0 paddq xmm7, xmm0
shufps xmm6, xmm7, 136 shufps xmm6, xmm7, 136
movups xmmword ptr [rdi + 4*rbx - 48], xmm5 movups xmmword ptr [rsi + 4*rcx], xmm5
movups xmmword ptr [rdi + 4*rbx - 32], xmm6 movups xmmword ptr [rsi + 4*rcx + 16], xmm6
movdqa xmm5, xmm0 movdqa xmm5, xmm0
paddq xmm5, xmm10 paddq xmm5, xmm10
movdqa xmm6, xmm1 movdqa xmm6, xmm1
paddq xmm6, xmm10 paddq xmm6, xmm10
movdqa xmm7, xmm3 movdqa xmm7, xmm3
paddq xmm7, xmm6 paddq xmm7, xmm6
paddq xmm6, xmm4 paddq xmm6, xmm4
movdqa xmm2, xmm3 movdqa xmm2, xmm3
paddq xmm2, xmm5 paddq xmm2, xmm5
paddq xmm5, xmm4 paddq xmm5, xmm4
shufps xmm6, xmm5, 136 shufps xmm6, xmm5, 136
shufps xmm7, xmm2, 136 shufps xmm7, xmm2, 136
movups xmmword ptr [rdi + 4*rbx - 16], xmm6 movups xmmword ptr [rsi + 4*rcx + 32], xmm6
movups xmmword ptr [rdi + 4*rbx], xmm7 movups xmmword ptr [rsi + 4*rcx + 48], xmm7
add rbx, 16 add rcx, 16
paddq xmm1, xmm11 paddq xmm1, xmm11
paddq xmm0, xmm11 paddq xmm0, xmm11
add rbp, 2 add rax, 2
jne .LBB0_17 jne .LBB1_30
As we can see, the loop basically contains the same instructions for the two methods. The order of the stores (movups
) is not the same but this should not impact the execution time (especially if the array is aligned in memory). The same thing applies for the different register names. The loop is vectorized using SIMD instructions (SSE) and unrolled 4 times so it can be pretty fast (4 items can be computed per SIMD unit and 16 items per iteration). About 62 iterations are needed for the inner-most loop to complete.
That being said, in both cases, the loops writes 4*1000*1000 = 3.81 MiB
of data. This typically fits in the L3 cache on relatively recent processors (or the RAM on old processors). The throughput of the L3/RAM is limited from a core (far lower than the L1 or even the L2 cache) so 1 core will likely stall waiting for the memory hierarchy to be ready. As a result, the loop are not so fast since they spend most of the time waiting for the memory hierarchy. Hardware prefetchers are pretty efficient on modern x86-64 processors so they can prefetech data before a core actually request it, especially for stores and if the written data is contiguous.
The array of array method is generally less efficient because each sub-array is not guaranteed to be allocated contiguously. Modern memory allocators typically use a bucket-based strategy to find memory blocks fitting to the requested size. In a program like this benchmark, the requested memory can be contiguous (or very close to be) since all the arrays are allocated in a raw and the bucket memory is generally not fragmented when a program starts. However, when the memory is fragmented, the arrays tends to be located in non-contiguous regions causing an effect called memory diffusion. Memory diffusion makes things harder for prefetchers to be efficient causing less efficient load/store. This is generally especially true for loads, but stores also cause loads here on most x86-64 processors (Intel processors or recent AMD ones) due to the write-allocate cache policy. Put it shortly, this is one main reason why the array of array method is generally less efficient in application. But this is not the only one : the other comes from the indirections.
The overhead of the additional indirections is pretty small in this benchmark mainly because of the memory-bound inner-loop. The pointers of the sub-arrays are stored contiguously so they can fit in the L1 cache and be efficiently prefetched. This means the indirections can be fast because they are unlikely to cause a cache miss. The indirection instruction cause additional load instructions but since most of the time in waiting the L3 cache or the RAM, the overhead of such instructions is very small if not even negligible. Indeed, modern processors execute instruction in parallel and in an out-of-order way, so the L1 access can be overlapped with L3/RAM load/stores. For example, Intel processors have dedicated units for that: the Line Fill Buffers (between the L1 and L2 caches), the Super-Queue (between the L2 and L3 cache) and the Integrated Memory Controller (between the L3 and the RAM). Most operations are done kind of asynchronously. That being said, things start to be synchronous when cores stall waiting on incoming data or buffers/queues are saturated.
This is possible with a smaller inner-most loop or if the 2D array is travelled non-contiguously. Indeed, if the inner-most loop only compute few items or if it is even replaced with 1 statement, then the overhead of the indirections are much more visible. The processor cannot (easily) overlap the overhead and the array of array method become slower than the matrix-based approach. here is the result of this new benchmark. The gap between the two method seems small but one should keep in mind that the cache is hot during the benchmark while it may not be in a real-world applications. Having a cold cache benefits to the matrix-based method which need fewer data to be loaded from the cache (no need to load the array of pointers).
To understand why the gap is not so huge, we need to analyse the assembly code again. The full assembly code can be seen on Godbolt. Clang use 3 different strategy to speed up the loop (SIMD, scalar+unrolling and scalar) but the unrolled one is the one that should be actually used in this case. Here is the hot loop for the matrix-based method:
.LBB0_27:
mov dword ptr [r12 + rdi], esi
lea ebx, [rsi + 1]
mov dword ptr [r12 + rdx], ebx
lea ebx, [rsi + 2]
mov dword ptr [r12 + rcx], ebx
lea ebx, [rsi + 3]
mov dword ptr [r12 + rax], ebx
add rsi, 4
add r12, r8
cmp rsi, r9
jne .LBB0_27
Here is the one for the array of array:
.LBB1_28:
mov rbp, qword ptr [rdi - 48]
mov dword ptr [rbp], edx
mov rbp, qword ptr [rdi - 32]
lea ebx, [rdx + 1]
mov dword ptr [rbp], ebx
mov rbp, qword ptr [rdi - 16]
lea ebx, [rdx + 2]
mov dword ptr [rbp], ebx
mov rbp, qword ptr [rdi]
lea ebx, [rdx + 3]
mov dword ptr [rbp], ebx
add rdx, 4
add rdi, 64
cmp rsi, rdx
jne .LBB1_28
At first glance, the second one seems clearly less efficient because there is far more instructions to execute. But as said previously, modern processors execute instructions in parallel. Thus, the instruction dependencies and especially the critical path play a significant role in the resulting performance (eg. dependency chains), not to mention the saturation of the processor units en more specifically the saturation of back-end ports of computing cores. Since the performance of this loop is strongly dependent of the target of the target architecture, we should consider a specific processor architecture in order to analyse how fast each method is in this case. Lets choose a relatively-recent mainstream architecture: Intel CoffeeLake.
The first loop is clearly bounded by the store instructions (mov dword ptr [...], ...
) since there is only 1 store port on this architecture while lea
and add
instruction can be executed on multiple ports (and the cmp
+jne
is cheap because it can be macro-fused and predicted). The loop should take 4 cycles per iteration unless it is bound by the memory hierarchy.
The second loop is more complex but it is also bounded by the store instructions mov dword ptr [rbp], edx
. Indeed, CofeeLake has two load ports so 2 mov rbp, qword ptr [...]
instructions can be executed per cycle; the same thing is true for the lea
which can also be executed on 2 ports; the add
and cmp+jne
are still cheap. The amount of instruction is not sufficiently big so to saturate the front-end so ports are the bottleneck here. In the end, the loop also takes 4 cycles per iteration assuming the memory hierarchy is not a problem. The thing is the scheduling of the instructions is not always perfect in practice so the dependencies to load instruction can introduce a significant latency if something goes wrong. Since there is a higher pressure on the memory hierarchy, a cache miss would cause the second loop to stall for many cycles as opposed to the first loop (which only do writes). Not to mention a cache miss is more likely to happen in the second case since there is a 8KB buffer of pointers to keep in the L1 cache for this computation to be fast: loading items from the L2 takes a dozen of cycle and loading data to the L3 can cause some cache-lines to be evicted. This is why the second loop is slightly slower in this new benchmark.
What if we use another processor? The result can be significantly different, especially since IceLake (Intel) and Zen2 (AMD) as they have 2 store ports. Things are pretty difficult to analyse on such processors (since not a unique port may be the bottleneck nor actually the back-end at all). This is especially true for Zen2/Zen3 having a 2 shared load/store ports and one dedicated only to stores (meaning 2 loads + 1 store scheduled in 1 cycle, or 1 load + 2 stores, or no load + 3 stores). Thus, the best is certainly to run practical benchmarks on such platforms while taking care to avoid benchmarking biases.
Note that the memory alignment of the sub-array is pretty critical too. With N=1024, the matrix-based method can be significantly slower. This is because the memory layout of the matrix-based method is likely to cause cache trashing in this case while the array-of-array-based method typically adds some padding preventing this issue in this very specific case. The thing is the added padding is typically sizeof(size_t)
for mainstream bucket-based allocators so the issue is just happening for another value of N
and not really prevented. In fact, for N=1022
, the array-of-array-based method is significantly slower. This perfectly match with the above explanation since sizeof(size_t) = 2*sizeof(int) = 8
on the target machine (64-bit). Thus, both methods suffers from this issue but it can be easily controlled with the matrix-based method by adding some padding while it cannot be easily controlled with the array-of-array-based method because the implementation of the allocator is dependent of the platform by default.