4

I was trying to understand performance of global static variable and came across a very weird scenario. The code below takes about 525ms average.

static unsigned long long s_Data = 1;

int main()
{
    unsigned long long x = 0;

    for (int i = 0; i < 1'000'000'000; i++)
    {
        x += i + s_Data;
    }

    return 0;
}

and this code below takes 1050ms average.

static unsigned long long s_Data = 1;

int main()
{
    unsigned long long x = 0;

    for (int i = 0; i < 1'000'000'000; i++)
    {
        x += i;
    }

    return 0;
}

I am aware that accessing static variables are fast, and writing to them is slow based on my other tests but I am not sure what piece of information I am missing out in the above scenario. Note: compiler optimizations were turned off and MSVC compiler was used to perform the tests.

armques
  • 117
  • 6
  • 10
    Performance measurements with optimization off are meaningless. – Igor Tandetnik Dec 24 '22 at 06:27
  • 2
    `"compiler optimizations were turned off"` makes the result fairly meaningless. Not sure where you heard that reading and writing static variables was fast and slow either. Sounds like nonsense to me. – Retired Ninja Dec 24 '22 at 06:27
  • Well, using the O2 optimization give me expected result. Thanks. But still, why turning on/off the optimization plays an important role here? – armques Dec 24 '22 at 06:33
  • @RetiredNinja first comment on this post https://www.reddit.com/r/cpp_questions/comments/7h3az1/static_vs_dynamic_performance/ Also, I am not sure where but I read somewhere static variables are not stored in registers and it can make things slow. Please do correct if this is wrong. – armques Dec 24 '22 at 06:38
  • Well, if you get your programming knowledge from random reddit posts... – Retired Ninja Dec 24 '22 at 06:42
  • 2
    @armques with optimizations turned off there's a lot of differences between the generated code from the compiler. See this: https://godbolt.org/z/j86q5vz3d to get an idea of just how drastically optimizations can change your generated code. – Substitute Dec 24 '22 at 06:49
  • @RetiredNinja I get my programming knowledge from a lot of trial and error and random posts from anywhere on the internet if it is not on StackOverflow. How about helping me and pointing me to the right direction instead of judging me on where I am getting my knowledge from? Thanks – armques Dec 24 '22 at 06:51
  • Maybe on that poster's target and with their compiler arguments there was a noticeable advantage. And maybe the variables they were using were f a type where keeping the instance alive prevented significant construction costs. Rule of thumb is if you're worried about performance, always measure. You measured, which is a great start, but your configuration was not optimal for real benchmarking. – user4581301 Dec 24 '22 at 06:51
  • 1
    A note: Due to [Undefined Behaviour](https://en.cppreference.com/w/cpp/language/ub), trial and error is risky in C++. – user4581301 Dec 24 '22 at 06:52
  • 2
    The linked Reddit post is asking about "static vs dynamic variable", which seems to be about accessing heap allocated (i.e. obtained by `new` or `malloc`) and non heap allocated (i.e. globals and stack-allocated locals) memory. This is a different concept than global vs local variable. – kotatsuyaki Dec 24 '22 at 07:07
  • Both code doesn't do similar computation, could be simplified to `x = 1'000'000'000 * 1'000'000'001 / 2 /* + 1'000'000'000 * s_Data */`, has no side effect, so computation could even be dropped. – Jarod42 Dec 24 '22 at 09:01

1 Answers1

8

To address the actual question, with optimizations turned off, we can turn to the generated assembly to get an idea on why one runs more quickly than the other.

In the first test, GCC (trunk) https://godbolt.org/z/GdssT9vME produces this assembly

s_Data:
        .quad   1
main:
        push    rbp
        mov     rbp, rsp
        mov     QWORD PTR [rbp-8], 0
        mov     DWORD PTR [rbp-12], 0
        jmp     .L2
.L3:
        mov     eax, DWORD PTR [rbp-12]
        movsx   rdx, eax
        mov     rax, QWORD PTR s_Data[rip]
        add     rax, rdx
        add     QWORD PTR [rbp-8], rax
        add     DWORD PTR [rbp-12], 1
.L2:
        cmp     DWORD PTR [rbp-12], 999999999
        jle     .L3
        mov     eax, 0
        pop     rbp
        ret

The second test https://godbolt.org/z/5ndnEv5Ts we get

main:
        push    rbp
        mov     rbp, rsp
        mov     QWORD PTR [rbp-8], 0
        mov     DWORD PTR [rbp-12], 0
        jmp     .L2
.L3:
        mov     eax, DWORD PTR [rbp-12]
        cdqe
        add     QWORD PTR [rbp-8], rax
        add     DWORD PTR [rbp-12], 1
.L2:
        cmp     DWORD PTR [rbp-12], 999999999
        jle     .L3
        mov     eax, 0
        pop     rbp
        ret

Comparing these two programs, the first is sixteen instructions, while the second is only fourteen instructions. (I'm sure you can guess that different instructions also have different cpu cycle overheads)
How many CPU cycles are needed for each assembly instruction?

As noted in my comment, optimizations vastly change the generated assembly.
Both tests produce this with -O2

main:
        xor     eax, eax
        ret
Substitute
  • 269
  • 1
  • 9