Why is this function running slow for a specific environment size?

Question

In their well-known paper Producing Wrong Data Without Doing Anything Obviously Wrong! the authors discuss the effect of code layout in memory on a program's performance. In particular, they analyze the effect of the UNIX environment size (i.e. the size of all environment variables) which is loaded onto the program stack prior to execution and therefore cause a shift in address.

They present the following (constructed) example of a program which exhibits extreme sensitivity to the environment size.

static int i = 0, j = 0, k = 0;
int main() {
    int g = 0, inc = 1;
    for (; g<65536; g++) {
        i += inc;
        j += inc;
        k += inc;
    }
    return 0;
}

The program is compiled with -O0 to avoid optimizations. The generated assembly code is

1119:       55                      push   rbp
111a:       48 89 e5                mov    rbp,rsp
111d:       c7 45 f8 00 00 00 00    mov    DWORD PTR [rbp-0x8],0x0
1124:       c7 45 fc 01 00 00 00    mov    DWORD PTR [rbp-0x4],0x1
112b:       eb 37                   jmp    1164 <main+0x4b>
112d:       8b 15 e1 2e 00 00       mov    edx,DWORD PTR [rip+0x2ee1]        # 4014 <i>
1133:       8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
1136:       01 d0                   add    eax,edx
1138:       89 05 d6 2e 00 00       mov    DWORD PTR [rip+0x2ed6],eax        # 4014 <i>
113e:       8b 15 d4 2e 00 00       mov    edx,DWORD PTR [rip+0x2ed4]        # 4018 <j>
1144:       8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
1147:       01 d0                   add    eax,edx
1149:       89 05 c9 2e 00 00       mov    DWORD PTR [rip+0x2ec9],eax        # 4018 <j>
114f:       8b 15 c7 2e 00 00       mov    edx,DWORD PTR [rip+0x2ec7]        # 401c <k>
1155:       8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
1158:       01 d0                   add    eax,edx
115a:       89 05 bc 2e 00 00       mov    DWORD PTR [rip+0x2ebc],eax        # 401c <k>
1160:       83 45 f8 01             add    DWORD PTR [rbp-0x8],0x1
1164:       81 7d f8 ff ff 00 00    cmp    DWORD PTR [rbp-0x8],0xffff
116b:       7e c0                   jle    112d <main+0x14>
116d:       b8 00 00 00 00          mov    eax,0x0
1172:       5d                      pop    rbp
1173:       c3                      ret

I attempted to reproduce their result by hooking __libc_start_main to setup performance counters before executing main. While I was able to reproduce the extreme increase in used CPU cycles (of around 44%) for a specific environment size, I am not able to fully explain it. (Also note that I disabled ASLR and CPU frequency scaling for reproducability). For the critical environment size, the variables in the example program are located at the following addresses:

&i       = 0x555555558014
&j       = 0x555555558018
&k       = 0x55555555801c
&g       = 0x7fffffffe018
&inc     = 0x7fffffffe01c

Using Linux's perf API I was able to determine that on average, this particular stack offset caused a slight increase in L1 Data Cache misses.

The Haswell CPU I am running the benchmark on has a 32K 8-way set associative L1 Data Cache. Thus,for a cache line size of 64B and 8 cache lines per cache set the 6 bits above the lowest 6 bits of the address form the cache set index. Applied to the above addresses, we find that the stack variables map to the same cache set as the global static ones:

&i       = 0x555555558014 --> page offset = 0x014, index = 0x00, offset = 0x14
&j       = 0x555555558018 --> page offset = 0x018, index = 0x00, offset = 0x18
&k       = 0x55555555801c --> page offset = 0x01C, index = 0x00, offset = 0x1C
&g       = 0x7fffffffe018 --> page offset = 0x018, index = 0x00, offset = 0x18
&inc     = 0x7fffffffe01c --> page offset = 0x01C, index = 0x00, offset = 0x1C

This statement will still be true if we shift the stack pointer by 16 bytes (doesn't matter if up or down) since the stack variables would reside on the same cache line. However, in that case, the performance will only take a negligible hit compared to the average case.

What could be the reason for this strange behaviour? Or am I missing something and the reason for the slowdown isn't the L1 cache miss rate at all?

That paper was written 14 years ago. Memory has gotten much cheaper so we put more memory in them, so computers are much less sensitive to issues like this. — Barmar, Aug 01 '23 at 17:58
@Barmar This does not only not answer the question or provide any additional insight, it is even wrong given that I was able to reproduce the effect. — ConfusedProgrammer, Aug 01 '23 at 18:03
@Barmar Also, the the amount of (main) memory and its price is irrelevant with regards to the question. — ConfusedProgrammer, Aug 01 '23 at 18:08
You must have compiled this with optimization disabled, since none of these are `volatile`. The bottleneck should be store-forwarding latency, not commit to L1d cache. (Unless you get some misses early on; with a larger iteration count do the timings settle down to not depend on stack alignment?) Sandybridge-family's variable latency for store-forwarding is the same issue behind [Adding a redundant assignment speeds up code when compiled without optimization](https://stackoverflow.com/q/49189685) — Peter Cordes, Aug 01 '23 at 18:16
You could run your program under `perf stat -e cycles,instructions,exe_activity.bound_on_stores,resource_stalls.sb,mem_load_retired.l1_miss` to see if the store buffer was getting full. (There aren't store versions of counters like `mem_load_retired.l1_miss` since commit to L1d happens after store instructions retire from OoO exec. But actually, if stores were evicting cache lines, that would cause an L1d miss for a load, since all your variables are getting loaded + stored except `inc` is pure loads.) — Peter Cordes, Aug 01 '23 at 18:20
IDK why this program would be sensitive to stack alignment; that's surprising. Haswell doesn't even have cache-bank conflicts the way 1st-gen Sandybridge does. (Which could account for 16-byte conflicts.) — Peter Cordes, Aug 01 '23 at 18:23
Oh, same offset in different pages (not just same cache line): I bet you're seeing 4k aliasing, where the store isn't disambiguated as fast, so for a couple cycles each iteration the CPU thinks the reads of `inc` are reloading a store of `i`, `j`, or `k`, and has to wait for the fallback compare of the full address. See [Understanding 4K aliasing on Intel CPU's](https://stackoverflow.com/q/54415140) / [L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes](https://stackoverflow.com/q/25774190) — Peter Cordes, Aug 01 '23 at 18:27
The `perf` event for 4K alias problems is `ld_blocks_partial.address_alias` (at least on Skylake; hopefully Haswell has the same counter.) — Peter Cordes, Aug 01 '23 at 18:34
Are you sure the addresses you show are for the slow case? I'd have expected the slow case to be when stack addresses are higher by `0x20`, like `...030` and `...034`, the same offset-within-page as `j` and `k`. If the slow addresses are `...010` and `...014`, I might have been hasty in closing as duplicates! — Peter Cordes, Aug 01 '23 at 18:37
@PeterCordes Your expectation was completely right, the addresses were indeed incorrect (due to the way I obtained them: by printf'ing them, the stack addresses were shifted relative to the version of the code without the printf [which I used for the benchmark]). So this effect indeed appears to be due to 4K aliasing. — ConfusedProgrammer, Aug 01 '23 at 19:40
Cool, that explains the mystery. Non-intrusive tools like debuggers are very useful :P — Peter Cordes, Aug 01 '23 at 19:43
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254755/discussion-between-confusedprogrammer-and-peter-cordes). — ConfusedProgrammer, Aug 01 '23 at 19:48

Why is this function running slow for a specific environment size?

0 Answers0