In their well-known paper Producing Wrong Data Without Doing Anything Obviously Wrong! the authors discuss the effect of code layout in memory on a program's performance. In particular, they analyze the effect of the UNIX environment size (i.e. the size of all environment variables) which is loaded onto the program stack prior to execution and therefore cause a shift in address.
They present the following (constructed) example of a program which exhibits extreme sensitivity to the environment size.
static int i = 0, j = 0, k = 0;
int main() {
int g = 0, inc = 1;
for (; g<65536; g++) {
i += inc;
j += inc;
k += inc;
}
return 0;
}
The program is compiled with -O0
to avoid optimizations.
The generated assembly code is
1119: 55 push rbp
111a: 48 89 e5 mov rbp,rsp
111d: c7 45 f8 00 00 00 00 mov DWORD PTR [rbp-0x8],0x0
1124: c7 45 fc 01 00 00 00 mov DWORD PTR [rbp-0x4],0x1
112b: eb 37 jmp 1164 <main+0x4b>
112d: 8b 15 e1 2e 00 00 mov edx,DWORD PTR [rip+0x2ee1] # 4014 <i>
1133: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
1136: 01 d0 add eax,edx
1138: 89 05 d6 2e 00 00 mov DWORD PTR [rip+0x2ed6],eax # 4014 <i>
113e: 8b 15 d4 2e 00 00 mov edx,DWORD PTR [rip+0x2ed4] # 4018 <j>
1144: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
1147: 01 d0 add eax,edx
1149: 89 05 c9 2e 00 00 mov DWORD PTR [rip+0x2ec9],eax # 4018 <j>
114f: 8b 15 c7 2e 00 00 mov edx,DWORD PTR [rip+0x2ec7] # 401c <k>
1155: 8b 45 fc mov eax,DWORD PTR [rbp-0x4]
1158: 01 d0 add eax,edx
115a: 89 05 bc 2e 00 00 mov DWORD PTR [rip+0x2ebc],eax # 401c <k>
1160: 83 45 f8 01 add DWORD PTR [rbp-0x8],0x1
1164: 81 7d f8 ff ff 00 00 cmp DWORD PTR [rbp-0x8],0xffff
116b: 7e c0 jle 112d <main+0x14>
116d: b8 00 00 00 00 mov eax,0x0
1172: 5d pop rbp
1173: c3 ret
I attempted to reproduce their result by hooking __libc_start_main
to setup performance counters before executing main
.
While I was able to reproduce the extreme increase in used CPU cycles (of around 44%) for a specific environment size, I am not able to fully explain it.
(Also note that I disabled ASLR and CPU frequency scaling for reproducability).
For the critical environment size, the variables in the example program are located at the following addresses:
&i = 0x555555558014
&j = 0x555555558018
&k = 0x55555555801c
&g = 0x7fffffffe018
&inc = 0x7fffffffe01c
Using Linux's perf
API I was able to determine that on average, this particular stack offset caused a slight increase in L1 Data Cache misses.
The Haswell CPU I am running the benchmark on has a 32K 8-way set associative L1 Data Cache. Thus,for a cache line size of 64B and 8 cache lines per cache set the 6 bits above the lowest 6 bits of the address form the cache set index. Applied to the above addresses, we find that the stack variables map to the same cache set as the global static ones:
&i = 0x555555558014 --> page offset = 0x014, index = 0x00, offset = 0x14
&j = 0x555555558018 --> page offset = 0x018, index = 0x00, offset = 0x18
&k = 0x55555555801c --> page offset = 0x01C, index = 0x00, offset = 0x1C
&g = 0x7fffffffe018 --> page offset = 0x018, index = 0x00, offset = 0x18
&inc = 0x7fffffffe01c --> page offset = 0x01C, index = 0x00, offset = 0x1C
This statement will still be true if we shift the stack pointer by 16 bytes (doesn't matter if up or down) since the stack variables would reside on the same cache line. However, in that case, the performance will only take a negligible hit compared to the average case.
What could be the reason for this strange behaviour? Or am I missing something and the reason for the slowdown isn't the L1 cache miss rate at all?