0

I made a C++ code with dumb processing just to test a few cache optimizations to study code improvements and stepped into something very weird...

Using the arrays as static or declaring outside the main function as global the code runs in 0.5 seconds (average) and if I just move the arrays to inside of the main function, the same processing runs in 15 seconds (average). I can't find why, and I'm just finding articles talking about how local variables are faster than locals.

Does someone have any idea what's happening? I'm compiling with C++ in Windows, installed using the MingW, and running the code on a Desktop with i3-7100.

EDIT:

  1. The goal is not the speed improvements, is just some tests to study the use of the cache, moving, removing or merge arrays. The optimization flags are changing the code to the perfect code and perfect speed in fact. But whats it's changing? Whats was wrong in the arrays location that is being fixed by the flags?
  2. I'm just compiling with g++ -o

EDIT 2:

I made the array initializations like some comments suggested and the not initialized is in fact the slower, and the initialized is the same speed of the global code. Why? What's happenning under the hood?

EDIT 3:

After some suggestions in the comments and explanations I made some testing with a fair test enviroment created by @paddy and shared in the comments: https://godbolt.org/z/8ev85qjP5

Code:

#include <chrono>
#include <iostream>

#define TAM 10
#define N 10000

#ifdef USE_GLOBAL
volatile double output[N] = {}, values[N] = {}, error[N] = {};
#endif

int main()
{   
#ifndef USE_GLOBAL
    volatile double output[N] = {}, values[N] = {}, error[N] = {};
#endif

    std::cout << "Starting" << std::endl;
    auto t1 = std::chrono::high_resolution_clock::now();
    {
        for (int total = 0; total < TAM; total++) {
            for (int i = 0; i < N; i++) {
                for (int j = 0; j < N; j++) {
                    output[i] += (values[j] + error[j]) / i + 1;
                }
            }
        }
    }

    auto t2 = std::chrono::high_resolution_clock::now();

    auto duration =
        (std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1)
             .count());

    float time = (float)duration / 1000000;

    std::cout << "Processing time = " << time << " seconds."
              << std::endl;
}
  • 8
    How did you compile your code? My results with `-O2` are not the same. – sweenish Jul 06 '22 at 01:36
  • 3
    Static variables are initialized to zero (0.0 for floating-point). Local variables are not. – Kaz Jul 06 '22 at 01:36
  • 3
    Did you compile both with optimizations enabled, e.g. `-O3 -flto` or the like? Without optimizations enabled, performance testing is essentially meaningless. – ShadowRanger Jul 06 '22 at 01:37
  • Sidenote: why the extra block around your for loops? – Chris Jul 06 '22 at 01:37
  • 1
    What happens if you change the declarations to `output[N] = { 0 }` for all arrays similarly. – Kaz Jul 06 '22 at 01:37
  • 6
    My money's on denormalization, I'm calling it now. – Blindy Jul 06 '22 at 01:37
  • Also, side-note: `float` has *terrible* precision. You're likely losing precision to `float time = (float)duration / 1000000;` that `double time = duration / 1000000.0;` would avoid. – ShadowRanger Jul 06 '22 at 01:38
  • Your second code has undefined behaviour, since the arrays are uninitialised (accessing the value of uninitialised variable causes undefined behaviour). In the presence of undefined behaviour, compilers often take liberties such as removing code in ways that are counter-intuitive to programmers - in this case, that can result in eliminating one or more of your nested loops. – Peter Jul 06 '22 at 01:40
  • Just tested on TIO, with `-O3 -flto`. With globals, it takes 0.1 seconds, without them, it takes 0 seconds (likely the optimizer realized it did nothing and/or involved undefined behavior and removed the pointless code entirely). – ShadowRanger Jul 06 '22 at 01:43
  • You are not compiling with optimizations enabled. The compiler should be completely removing the loops in the second snippet if you had them enabled. Without enabling optimizations any benchmark is pointless. So I am voting to close as non-reproducible. – user17732522 Jul 06 '22 at 01:44
  • I'm not using any optimization flags. just the g++ -o . – Moises Cavcalcanti Jul 06 '22 at 01:45
  • 1
    Please draw your attention to the upvoted comments about static versus local initialization, and the one about denormalized floats. If you create a [fair test](https://godbolt.org/z/3xbfE638G), with optimizations enabled, the arrays declared `volatile` and explicitly zero-initialized, then the execution of each program is about the same. – paddy Jul 06 '22 at 01:47
  • Why the compiler should remove my loops if I use the optimization flag? I need a code with the minimun of compiler optimization because the goal is to see improvments in cache using. If I put optimization flags on, the code is a lot of different than I made so I can't see my cache improvements. – Moises Cavcalcanti Jul 06 '22 at 01:47
  • 1
    @MoisesCavcalcanti: Because the timed code: 1) Has no observable side-effects, so the compiler can choose to omit it (aside from performance, it behaves the same either way), and 2) In the case of locals, since you never initialize them, it's undefined behavior to read from them, so the compiler is allowed to do anything it likes, including make demons fly out of your nose. Don't invoke nasal demons. On item #1, you can make some pointless computation involving all of the outputs and emit it to the screen (so the math means something), and on #2, you can add `= {0}` for each local array. – ShadowRanger Jul 06 '22 at 01:50
  • Optimization is optimizing *away* all the work, because you don't use the results of `output[]` (print or assign to a `volatile`). e.g. after the timed region, `volatile double sink = output[argc];` - the compiler won't know which element you're reading, so it has to make sure they're all computed. See also [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) re: compiler optimization – Peter Cordes Jul 06 '22 at 01:55
  • I'm understand all the comments, but after some testing suggeted here, I identified that the slow code is in fact the not initialized arrays. If I put some values inside, the speed is the same as the global arrays. So the problem is not with the optimization flags – Moises Cavcalcanti Jul 06 '22 at 01:57
  • "_is just some tests to study the use of the cache, moving, removing or merge arrays._": What exactly are you referring to? – user17732522 Jul 06 '22 at 01:57
  • The flags are fixing the slower code. But the point is that I'm searching exactly what was fixed. The point is to understand the computer architecture and code compiling – Moises Cavcalcanti Jul 06 '22 at 01:58
  • @user17732522 I'm a master degree student in computer science and one of the exercises of the computer archtecture discipline is to create a code, apply some basic cache memory usage improvements and write some texts explaning why it's improving the code – Moises Cavcalcanti Jul 06 '22 at 02:01
  • @MoisesCavcalcanti: I'm going to say this one last time: If you perform performance testing without optimizations enabled, your tests mean nothing, because *no* real world performance-sensitive code is compiled without it. Whether two versions of the code happen to perform the same with optimizations off tells you nothing about how they'll perform with optimizations on, and vice-versa. If you compare the performance of, say, C-style arrays and `std::vector` without optimizations, you'll think `std::vector` *sucks*. It does not, but w/o optimizations, zero-cost abstractions are not zero-cost. – ShadowRanger Jul 06 '22 at 02:01
  • @ShadowRanger I guess that maybe you didn't get the right point. I'm not interested in find the fastest code. In your example and don't care how many kilograms are attached to the athlets because I'm studing how they think, not how they move. I'm finding where is this discrepance and what's the optimization flags are fixing – Moises Cavcalcanti Jul 06 '22 at 02:04
  • 1
    @selbie: I was about to add some more duplicates like [What exactly is the value contained in an uninitialized local variable in C?](https://stackoverflow.com/q/52616928) to go with [Why does changing 0.1f to 0 slow down performance by 10x?](https://stackoverflow.com/q/9314534) about subnormals being slow. That's almost certainly what's going on here. But you reopened the question before I could edit the duplicate list. – Peter Cordes Jul 06 '22 at 02:05
  • @ShadowRanger I'm going to say this, how many times it's necessary. I want to know whats happening under the hood. What's the flag optimization is changing in my code? Why initializing it also fixes the problem? The point is not make a real world code, is understand computer achitecture – Moises Cavcalcanti Jul 06 '22 at 02:06
  • 1
    The contents of `errors` and `values` are **undefined** when declared in the local scope. Whereas at global scope they have been initialized to zero. When the program runs, there's a good chance the contents of the arrays are holding INVALID floating point values and trigger FP exceptions. That's just a guess, but I'm with @Kaz. I'd like to know what happens when these arrays are properly zero'd out. – selbie Jul 06 '22 at 02:08
  • 2
    Uninitialized memory can contain values with any bit pattern, including poorly-formed float values that are in fact [denormalized](https://stackoverflow.com/q/14001910/1553090). These are very expensive when it comes to math operations, and can tank performance. Regarding the use of optimizations throwing away your loops, that has already been explained in comments. If you want to force it, then declare the arrays as volatile, or perform some calculation on the arrays after testing to produce an actual value that you output. Either should cause the compiler to leave the loops in. – paddy Jul 06 '22 at 02:09
  • 4
    @MoisesCavcalcanti: If you don't want the compiler to do major things like auto-vectorize, use `gcc -Og` to get minimal optimizations that let it keep variables in registers. Or `-O2 -fno-tree-vectorize` for most optimizations but not SIMD. Benchmarking without optimization at all is *not* a uniform slowdown for all code, some code will suffer more than others (depending on whether it bottlenecks on store/reload latency of loop variables, for example). See also [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) – Peter Cordes Jul 06 '22 at 02:09
  • 1
    In this case it shouldn't distort things too much, because you're still just using a C array either way, but it will reduce the difference between good and bad cache access patterns, or between the matrix fitting in L1d or not. – Peter Cordes Jul 06 '22 at 02:09
  • @PeterCordes - I think we can close with the dupe you suggested that has the answer about denormalization. But would you mind if we leave the question open for a bit longer? – selbie Jul 06 '22 at 02:12
  • @selbie Thank you! I thinking we are going in the right way. I'm going to read your links – Moises Cavcalcanti Jul 06 '22 at 02:12
  • @PeterCordes Thank you too! You and selbie are helping me a lot! I really aprecciate all the links and articles that you can send so I can take a look and learn more. Thanks also by the flags, I'm going to take more tests – Moises Cavcalcanti Jul 06 '22 at 02:15
  • 3
    If you want to know what's happening under the hood, look at the generated asm (e.g. on https://godbolt.org/z/KW1fd9hjM with a `volatile double sink = output[argc];` so I can use `-O2` without optimizing it away). Or single-step it with a debugger locally, and look at some of the array values. This is a very simple loop with plain C arrays, so it should be possible to follow scalar asm. – Peter Cordes Jul 06 '22 at 02:16
  • 1
    @selbie: I can't vote to close again, but if you want to wait to come back to it later, sure. There's like 4 different problems here: uninitialized locals giving subnormals probably; subnormals being slow (unlike NaN or infinities, which are fast in modern FPUs); and benchmarking with optimization disabled. Of course, optimization disabled means that later code doesn't "see" that arrays were uninitialized, so you basically get a well-defined snapshot of garbage on the stack; GCC's default `-O0` doesn't optimize across statements. And 4th, how to make a benchmark that doesn't opt away. – Peter Cordes Jul 06 '22 at 02:20
  • 3
    If you don't initialize the local variables, then you're not comparing apples and apples. The global variable version runs on predictable inputs of all zeros; the local variable version runs on garbage. You should almost never compare the performance two implementations of a program on different inputs as a basic principle, before we even consider other effects like undefined behavior or optimization. – Kaz Jul 06 '22 at 08:27
  • @Kaz Thank you! This is also helping me to uderstand everything better. – Moises Cavcalcanti Jul 06 '22 at 12:19
  • @paddy Sorry! I didn't see your comment, you are also right. I'm now getting all the pieces together with the other comments – Moises Cavcalcanti Jul 06 '22 at 12:33
  • 1
    In your USE_GLOBAL version, you're reading never-written memory from the BSS, so all those pages are still copy-on-write mapped to a single shared page of zeros, unlike the locals where the compiler makes code that actually writes that memory, so each input array (of 80000 bytes each) is separate physical pages. 3x 80kB is a total of 234 KiB, so larger than L1d but comfortably fits in L2 on Skylake-server and maybe the Epyc instances Godbolt's AWS instances sometimes uses, that you're benchmarking on. 5.35 s vs. 5.41 s is a pretty small margin for a noisy environment like cloud VMs. – Peter Cordes Jul 06 '22 at 13:44
  • 1
    It doesn't help that your inner loop has such nasty bottlenecks other than cache misses, though. You used `volatile output[]` and put `output[i] += stuff` inside your inner-most loop (instead of using a temporary the compiler could keep in a register), so that extra store/reload latency is part of the loop. Although it might still bottleneck on FP-divide throughput since you put a ridiculously slow operation like `/ i` inside the same inner loop, but as part of the independent input to each add. So at least you don't have divide *latency*, just throughput, but still... – Peter Cordes Jul 06 '22 at 13:46
  • 1
    See [Why is iterating though \`std::vector\` faster than iterating though \`std::array\`?](https://stackoverflow.com/q/57125253) re: lazily allocated memory that reads as zero but was never written (like your globals), vs. memory that was written by user-space (like your `= {}` stack arrays.) – Peter Cordes Jul 06 '22 at 13:55
  • 1
    Another *possible* effect could be due to the different machine-code size of `movsd xmm0, values[0+rdx*8]` vs. `movsd xmm0, [rsp+80016+rdx*8]` and so on in the inner loop (different addressing modes); the diff view on Godbolt shows that's about all: https://godbolt.org/z/Y7WbGPn5E. On Skylake specifically, I'd try [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) in case it made a difference. – Peter Cordes Jul 06 '22 at 13:59
  • 1
    I don't expect a front-end bottleneck to be a factor, but 4c throughput for `divsd` means it could be in this loop with 11 uops, if not for the store/reload bottleneck introduced by updating the `volatile` output element inside the inner loop. I think `volatile` on the inputs is also preventing GCC from folding a load into a memory source operand for `addsd`, costing extra front-end uops. (Which isn't a bottleneck, but takes more ROB space so reduces how far out-of-order exec can see: https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/) – Peter Cordes Jul 06 '22 at 14:01
  • @PeterCordes It's exactly what I'm studyng now, how to do small changes to create a better cache friendly code. I need a bad and commom code and then made the changes to improve speed just improving the cache use – Moises Cavcalcanti Jul 06 '22 at 17:18
  • 1
    If you want it to be able to speed up with a better cache access pattern, don't create other throughput bottlenecks that are as much of a bottleneck as L1d cache misses that hit in L2 (or maybe L3) – Peter Cordes Jul 06 '22 at 22:40

0 Answers0