Understanding C# SIMD output

Question

I have following snippet which sums all the elements of the array (size is hardcoded and is 32):

static unsafe int F(int* a) 
{
    Vector256<int> ymm0 = Avx2.LoadVector256(a + 0);
    Vector256<int> ymm1 = Avx2.LoadVector256(a + 8);
    Vector256<int> ymm2 = Avx2.LoadVector256(a + 16);
    Vector256<int> ymm3 = Avx2.LoadVector256(a + 24);

    ymm0 = Avx2.Add(ymm0, ymm1);
    ymm2 = Avx2.Add(ymm2, ymm3);

    ymm0 = Avx2.Add(ymm0, ymm2);

    const int s = 256 / 32;
    int*      t = stackalloc int[s];

    Avx2.Store(t, ymm0);

    int r = 0;
    for (int i = 0; i < s; ++i)
        r += t[i];

    return r;
}

this generates following ASM:

Program.F(Int32*)
    L0000: sub rsp, 0x28
    L0004: vzeroupper                       ; Question #1
    L0007: vxorps xmm4, xmm4, xmm4
    L000b: vmovdqa [rsp], xmm4              ; Question #2
    L0010: vmovdqa [rsp+0x10], xmm4         ; Question #2
    L0016: xor eax, eax                     ; Question #3
    L0018: mov [rsp+0x20], rax
    L001d: mov rax, 0x7d847bd1f9ce          ; Question #4
    L0027: mov [rsp+0x20], rax
    L002c: vmovdqu ymm0, [rcx]
    L0030: vmovdqu ymm1, [rcx+0x20]
    L0035: vmovdqu ymm2, [rcx+0x40]
    L003a: vmovdqu ymm3, [rcx+0x60]
    L003f: vpaddd ymm0, ymm0, ymm1
    L0043: vpaddd ymm2, ymm2, ymm3
    L0047: vpaddd ymm0, ymm0, ymm2
    L004b: lea rax, [rsp]                   ; Question #5
    L004f: vmovdqu [rax], ymm0
    L0053: xor edx, edx                     ; Question #5
    L0055: xor ecx, ecx                     ; Question #5
    L0057: movsxd r8, ecx
    L005a: add edx, [rax+r8*4]
    L005e: inc ecx
    L0060: cmp ecx, 8
    L0063: jl short L0057
    L0065: mov eax, edx
    L0067: mov rcx, 0x7d847bd1f9ce          ; Question #4
    L0071: cmp [rsp+0x20], rcx
    L0076: je short L007d
    L0078: call 0x00007ffc9de2d430          ; Question #6
    L007d: nop
    L007e: vzeroupper
    L0081: add rsp, 0x28
    L0085: ret

Questions

Why do we need VZEROUPPER at the beginning. Wouldn't it be perfectly fine without it?
What do the VMOVDQAs do in the beginning. Or rather why are they there?
Zeroing out the EAX register? Why? Probably related to next line MOV [RSP+0x20], RAX, but still can't understand.
What does this mysterious value (0x7d847bd1f9ce) do?
There are also lines in between which I can not understand why are they needed (see "Question #5" comments in the code).
I'm assuming this line (L0078: call 0x00007ffc9de2d430) throws an exception. Is there a function or something in my code that can throw an exception?

I know there are lot of question, but I can't separate them because they are related to each other I think. TO BE CRYSTAL CLEAR: I'm just trying to understand the generated ASM here. I'm not a professional in this area.

Note

In case you're wondering what GCC (O2) generates, here is the result:

int32_t
f(int32_t *a) {
        __m256i ymm0;
        __m256i ymm1;
        __m256i ymm2;
        __m256i ymm3;

        ymm0 = _mm256_load_si256((__m256i*)(a + 0));
        ymm1 = _mm256_load_si256((__m256i*)(a + 8));
        ymm2 = _mm256_load_si256((__m256i*)(a + 16));
        ymm3 = _mm256_load_si256((__m256i*)(a + 24));
           
        ymm0 = _mm256_add_epi32(ymm0, ymm1);
        ymm2 = _mm256_add_epi32(ymm2, ymm3);

        ymm0 = _mm256_add_epi32(ymm0, ymm2);

        int32_t t[8];
        _mm256_store_si256((__m256i*)t, ymm0);

        int32_t r;
        r = 0;
        for (int i = 0; i < 8; ++i)
                r += t[i];

        return r;
}

And the generated ASM:

f:
  push rbp
  xor r8d, r8d
  mov rbp, rsp
  and rsp, -32
  lea rax, [rsp-32]
  mov rdx, rsp
  vmovdqa ymm1, YMMWORD PTR [rdi+96]
  vpaddd ymm0, ymm1, YMMWORD PTR [rdi+64]
  vpaddd ymm0, ymm0, YMMWORD PTR [rdi+32]
  vpaddd ymm0, ymm0, YMMWORD PTR [rdi]
  vmovdqa YMMWORD PTR [rsp-32], ymm0
.L2:
  add r8d, DWORD PTR [rax]
  add rax, 4
  cmp rax, rdx
  jne .L2
  mov eax, r8d
  vzeroupper
  leave
  ret

I think It optimized (maybe heavily) my code here, but whatever.

I would suggest creating an issue on the JIT GitHub repository, since this is pretty obviously the JIT compiler failing to optimize code as much as it could. The SIMD stuff is pretty new after all, and optimizations are generally driven by community needs. — Blindy, Apr 22 '21 at 18:47
@Blindy the thing is that I don't know if they are "wrong" or not. That's why I'm asking. I want to understand if they're "wrong" or not. AND because I'm not an expert here, the chances are pretty high that I'm wrong and they (the instructions) have a meaning. — , Apr 22 '21 at 18:48
I second bringing it to the attention of the JIT Github repo maintainers. They'll be most qualified to determine if it's correct or incorrect, and can explain why better than we probably can. — , Apr 22 '21 at 18:59
@aepot why do I have bound check when I'm using a pointer and not a managed array. Maybe I misunderstood your comment. If I'm not wrong C# usually checks for the value and jumps to the location but here we see something different and not jumps. — , Apr 22 '21 at 19:09
`t[i]` is accessing address by index in allocated area, that's why you have bound-check. Try `r += *(t + i)` — aepot, Apr 22 '21 at 20:03
@aepot didn't help. [SharpLab](https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABABgAJiBGAOgCUBXAOwwEt8YaBJFqVp3KzC4A3AFgAUGUq1GLdpx4Y+AobhoANABxJxEycQDMlAEwyA7OUkBvSeXtWpVJOWa5sAMxjl+GcgDEACl8AKnJsAEpHWwkHcgA1GDAMaBMAViQAHl8APnIAT3x8CgBecgBBADcEExoAGQhsABNE5NSMwOxyAGpyUgi9ONaUqHSs3IKiqnIyqpr6xpakkbHOnvItAbsHYfbxljzC/DNZ6tqG5t3Rjq7e5y3YneW97IPJ/GNT+YultuukNa9EwoB7bexHUoVM40cpNJqBCFod5UB5xI4nKHzWHw9FIo6GUGPcFFSFzWrYhEkvFFEyEuKQAR+XzkXAzchjcgAenIhhMgwcoTi9j8ZVwGGwYAA1tgADYyiBgHwsADauAAunowZjagBlEYwQIYanFOkCljkKBs0j8+weaDkYLm1hWkQ+ciZFmu7rdVgRLVxS3dMohQ3rX2aomUSxQPQAXy1CYksaAA=) — , Apr 22 '21 at 20:05
@Hrant maybe stack upper bound check then as described in yhe accepted answer. — aepot, Apr 22 '21 at 20:15
@aepot had to read the accept answer multiple times. Yeah, it might be. — , Apr 22 '21 at 20:23
For info about performance impacts of `vzeroupper` see https://stackoverflow.com/questions/49019614/is-it-useful-to-use-vzeroupper-if-your-programlibraries-contain-no-sse-instruct and https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake — Nate Eldredge, Apr 23 '21 at 00:06
Note that in your C version, GCC is saving you from yourself by aligning `int32_t t[8];` even though you forgot `alignas(32)`. It sees you using a 32-byte store and chooses to over-align the destination for performance (even if you use `storeu` https://godbolt.org/z/n8M6Ycche). Since you used `_mm256_store_si256` (32-byte alignment required like `vmovdqa`), GCC's choice to align the array happens to prevent possible segfaults. Also note that with `-O3`, GCC will auto-vectorize the hsum instead of actually storing / reloading and not touch the stack at all. — Peter Cordes, Apr 23 '21 at 02:08

stepan · Answer 1 · 2021-04-23T06:19:23.727

Why do we need VZEROUPPER at the beginning. Wouldn't it be perfectly fine without it?

Inserting vzeroupper in the beginning may be a workaround for a library/some other third party code that is known to forget to clean it's uppers (to protect SSE code). But you're not using SSE code, you only have AVX code, so yes, it's not needed in the beginning.

Your code is using VEX-encoded instructions (v prefix), which means it would not encounter a "false dependency" (transition penalties) problem (Why is this SSE code 6 times slower without VZEROUPPER on Skylake?). And on top of that you're using ymm vectors immediately (entering Dirty Upper State), which means any reasoning for power management/frequency scaling is also not applying here (Dynamically determining where a rogue AVX-512 instruction is executing - mentions forgotten vzeroupper causing reduced frequency for entire app).

What do the VMOVDQAs do in the beginning. Or rather why are they there?

L0007: vxorps xmm4, xmm4, xmm4
L000b: vmovdqa [rsp], xmm4              ; Question #2
L0010: vmovdqa [rsp+0x10], xmm4         ; Question #2

Why is it zeroing out the memory that you're going to fully overwrite? My guess is that the compiler does not fully compute write coverage of the loop, so it does not know you will fully overwrite it. So it zeros it just in case.

Zeroing out the EAX register? Why? Probably related to next line MOV [RSP+0x20], RAX, but still can't understand.

L0016: xor eax, eax                     ; Question #3
L0018: mov [rsp+0x20], rax
L001d: mov rax, 0x7d847bd1f9ce          ; Question #4
L0027: mov [rsp+0x20], rax

So it writes 64-bit zero at address rsp+0x20 and then overwrites the same memory region with a stack canary. Why does it need to write a zero there first? I don't know, looks like a missed optimization.

What does this mysterious value (0x7d847bd1f9ce) do? I'm assuming this line (L0078: call 0x00007ffc9de2d430) throws an exception. Is there a function or something in my code that can throw an exception?

As already mentioned it's the stack canary to detect buffer overrun.

"The use of stackalloc automatically enables buffer overrun detection features in the common language runtime (CLR). If a buffer overrun is detected, the process is terminated as quickly as possible to minimize the chance that malicious code is executed" - quote from https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/stackalloc

It writes a value that it knows at the end of the stack buffer. Then executes the loop that you have. Then it checks if the value changed (if it did, means your loop wrote out of bounds). Note, that this is a huge stack canary. Not sure why they have to use 64-bit. Unless there is a good reason for it to be 64-bit I would consider this a missed optimization. It's large in code-size and for uop-cache and it's causing the compiler to emit more instructions (have to always use mov, can't use 64-bit constant as immediate operand of any other instruction, such as cmp or store mov).

Also, a note on canary-checking code

L0071: cmp [rsp+0x20], rcx
L0076: je short L007d
L0078: call 0x00007ffc9de2d430          ; Question #6
L007d: nop

Fall-through path should be the most-likely taken path. In this case, the fall-through path is the "throw exception", which shouldn't be normal. It may be another missed optimization. The way it could affect performance is - if this code is not in branch history, then it'll suffer a branch miss. If it's predicted correctly then it'll be fine. And indirect affect - taken branches occupy space in branch predictor history. If this branch was never taken - would be cheaper.

There are also lines in between which I can not understand why are they needed (see "Question #5" comments in the code).

L004b: lea rax, [rsp]                   ; Question #5
L004f: vmovdqu [rax], ymm0
L0053: xor edx, edx                     ; Question #5
L0055: xor ecx, ecx                     ; Question #5

LEA is not needed here. My guess is that it's related to how compiler does register allocation/stack management, so it's just a quirk of the compiler (rsp can't be allocated like a normal register, it's always used as stack pointer, so it has to be treated specially).

Zeroing edx - it's used as an accumulator for the final result. Zeroing ecx - used as counter in the loop that follows.

About horizontal sum at the end.

In general, when storing and reading from the same location, but different offset/size - need to check against store-forwarding rules for your target CPU to not suffer a penalty (you can find those at https://www.agner.org/optimize/#manuals, Intel and AMD have the rules listed in their guides as well). If you're targeting modern CPUs (Skylake/Zen), you shouldn't suffer a store-forwarding stall in your case, but there are still faster ways to sum up a vector horizontally. (And it has a bonus of avoiding missed optimizations related to the stack buffer).

Check out this nice writeup on good ways to sum a vector horizontally: https://stackoverflow.com/a/35270026/899255 You could also check out how a compiler does it: https://godbolt.org/z/q74abrqzh (GCC at -O3).

Agreed, `vzeroupper` *before* 256-bit instructions is only ever useful if you've already hit one transition penalty and are in [the preserved-dirty-uppers state](https://stackoverflow.com/a/41349852). (On Haswell or IceLake, not Skylake where the only kind of penalty is false-dependencies on legacy-SSE instructions). And even then, executing a YMM instruction to transition back into the dirty uppers state might be just as bad / good; unless vzeroupper gets you out of that sate more cheaply? C++ compilers only put vzeroupper *after* YMM use — Peter Cordes, Apr 23 '21 at 01:44
Oh, interesting, C# JIT can use immediate stack canaries and still randomize, unlike AoT compilers where that would expose the canary in the executable. (It has to be secret or else a buffer overflow could potentially include the right value, especially if it didn't include a `00` byte.) C++ compilers for x86-64 use thread-local storage with a qword randomized on process startup. It's 64-bit to make it harder to randomly / brute-force guess, I'd assume. (In a case where a service was restarted after every crash / abort, brute force could keep trying until you match the stack cookie.) — Peter Cordes, Apr 23 '21 at 01:50
`LEA rax, [rsp]` - yeah, that seems like another JIT missed-optimization for `int* t = stackalloc int[s];`. It insists on actually having a pointer in a register, instead of addressing the stack memory relative to RSP. (Although it zeroed the memory relative to RSP, but I guess that was just part of the stack frame, not the array.) And it misses the `mov rax, rsp` peephole, and doesn't convert the array index to pointer increment. It even pointlessly does `movsxd` sign extension for `i` every iteration, apparently not even trying to prove that it's always non-negative. — Peter Cordes, Apr 23 '21 at 02:00
GCC -O3 will even auto-vectorize the OP's hsum https://godbolt.org/z/n8M6Ycche, you could have used that as an example instead of an arbitrary-sized array sum. (I noticed while commenting on the question about the C missing an `alignas(32)` to make `store` instead of `storeu` safe if GCC hadn't chosen to align more than the standard 16 bytes for arrays of 16 bytes or larger (x86-64 SysV). — Peter Cordes, Apr 23 '21 at 02:10

score 1 · Answer 2 · answered Nov 14 '21 at 06:47

@stepan explained the RyuJIT-generated code quite well, but I thought I would address the question of why the GCC code is so different and why RyuJIT missed so many potential optimizations.

The short answer is that being Just In Time, RyuJIT has a very limited time budget in which to optimize, so it optimizes for frequently-used patterns. In your case the JIT may be taking your code a bit too literally, while GCC is able to capture your intent a bit better.

The stack canary code can be eliminated simply by removing the stackalloc and using a Vector256<T> local instead. Additionally, the loop over the stack values is missing a few optimizations, like your i variable being sign-extended on each iteration. This version of your method resolves both of those issues by helping the JIT out with things it knows how to optimize.

static unsafe int F(int* a) 
{
    Vector256<int> ymm0 = Avx.LoadVector256(a + 0);
    Vector256<int> ymm1 = Avx.LoadVector256(a + 8);
    Vector256<int> ymm2 = Avx.LoadVector256(a + 16);
    Vector256<int> ymm3 = Avx.LoadVector256(a + 24);

    ymm0 = Avx2.Add(ymm0, ymm1);
    ymm2 = Avx2.Add(ymm2, ymm3);

    ymm0 = Avx2.Add(ymm0, ymm2);

    // This address-taken local will be forced to the stack
    Vector256<int> ymm4 = ymm0;
    int* t = (int*)&ymm4;

    // RyuJIT unrolls loops of Vector<T>.Count,
    // Vector128<T>.Count, and Vector256<T>.Count
    int r = 0;
    for (int i = 0; i < Vector256<int>.Count; ++i)
        r += *(t + i);

    return r;
}

compiles to:

Program.F(Int32*)
    L0000: sub rsp, 0x38
    L0004: vzeroupper
    L0007: vmovdqu ymm0, [rcx]
    L000b: vmovdqu ymm1, [rcx+0x20]
    L0010: vmovdqu ymm2, [rcx+0x40]
    L0015: vmovdqu ymm3, [rcx+0x60]
    L001a: vpaddd ymm2, ymm2, ymm3
    L001e: vpaddd ymm0, ymm0, ymm1
    L0022: vpaddd ymm0, ymm0, ymm2
    L0026: vmovupd [rsp], ymm0        ; write to the stack with no zeroing/canary
    L002b: lea rax, [rsp]
    L002f: mov edx, [rax]             ; auto-unrolled loop
    L0031: add edx, [rax+4]
    L0034: add edx, [rax+8]
    L0037: add edx, [rax+0xc]
    L003a: add edx, [rax+0x10]
    L003d: add edx, [rax+0x14]
    L0040: add edx, [rax+0x18]
    L0043: add edx, [rax+0x1c]
    L0046: mov eax, edx
    L0048: vzeroupper
    L004b: add rsp, 0x38
    L004f: ret

Note that the stack zeroing, the stack canary write, check, and possible throw are all gone. And the loop is auto-unrolled, with more optimal scalar load/add code.

Beyond that, as other comments/answers have suggested, the spill to the stack and scalar adds are unnecessary, because you can use SIMD instructions to add horizontally. RyuJIT will not do this for you like GCC can, but if you are explicit, you can get optimal SIMD ASM.

static unsafe int F(int* a) 
{
    Vector256<int> ymm0 = Avx.LoadVector256(a + 0);
    Vector256<int> ymm1 = Avx.LoadVector256(a + 8);
    
    // The load can be contained in the add if you use the load
    // as an operand rather than declaring explicit locals
    ymm0 = Avx2.Add(ymm0, Avx.LoadVector256(a + 16));
    ymm1 = Avx2.Add(ymm1, Avx.LoadVector256(a + 24));

    ymm0 = Avx2.Add(ymm0, ymm1);
    
    // Add the upper 128-bit lane to the lower lane
    Vector128<int> xmm0 = Sse2.Add(ymm0.GetLower(), ymm0.GetUpper());
    
    // Add odd elements to even
    xmm0 = Sse2.Add(xmm0, Sse2.Shuffle(xmm0, 0b_11_11_01_01));
    
    // Add high half to low half
    xmm0 = Sse2.Add(xmm0, Sse2.UnpackHigh(xmm0.AsInt64(), xmm0.AsInt64()).AsInt32());
    
    // Extract low element
    return xmm0.ToScalar();
}

compiles to:

Program.F(Int32*)
    L0000: vzeroupper
    L0003: vmovdqu ymm0, [rcx]
    L0007: vmovdqu ymm1, [rcx+0x20]
    L000c: vpaddd ymm0, ymm0, [rcx+0x40]
    L0011: vpaddd ymm1, ymm1, [rcx+0x60]
    L0016: vpaddd ymm0, ymm0, ymm1
    L001a: vextracti128 xmm1, ymm0, 1
    L0020: vpaddd xmm0, xmm0, xmm1
    L0024: vpshufd xmm1, xmm0, 0xf5
    L0029: vpaddd xmm0, xmm0, xmm1
    L002d: vpunpckhqdq xmm1, xmm0, xmm0
    L0031: vpaddd xmm0, xmm0, xmm1
    L0035: vmovd eax, xmm0
    L0039: vzeroupper
    L003c: ret

which, aside from the overly-conservative vzerouppers, is the same as you'd get from an optimizing C/C++ compiler.

`lea rax, [rsp]` is hilarious. It never makes sense to use LEA to copy a register, and the only advantage of copying RSP at all is to avoid the SIB byte in every addressing mode involving it. Also amusing is that RyuJIT is so short-sighted that it doesn't leave EAX free to accumulate into the return-value register. So with 4-byte `lea rax, [rsp]` + 2 byte `mov eax, edx`, there's only a net 2-byte saving vs. `mov eax, [rsp]` / `add eax, [rsp+4]` / ... Anyway, nice answer, [Fastest way to do horizontal SSE vector sum](https://stackoverflow.com/a/35270026) manually is clearly a win. — Peter Cordes, Nov 14 '21 at 18:48

score 0 · Accepted Answer · answered Apr 22 '21 at 19:34

0

vzeroupper can help performance.

The L0007 thru L0018 lines are zeroing out the storage space used by the local variables.

The 0x7d847bd1f9ce value appears to be related to detecting stack overruns. It sets in a check value, and when the function is done it looks to see if that value has changed. If it has it calls a diagnostic function.

The function body starts at L002c. First it initializes your local ymm variables, then does the additions.

The lea at L004b is the allocation of t. The next instruction (L004f) is the Avx2.Store(t, ymm0); statement.

L0053 thru L0063 is the for loop. rax already has the value of t, ecx holds i, and edx holds r.

From L0065 to the end we have the return statement and function epilog. The epilog checks to see if the stack has been clobbered, does some cleanup, and returns to the caller.

answered Apr 22 '21 at 19:34

1201ProgramAlarm

32,384
7
42
56

Thank you for the answer the link for `VZEROUPPER` says "it will eliminate performance penalties caused by false dependencies". Can you elaborate on that? Which kind of `false dependencies` can occur? – Apr 22 '21 at 19:38
Also can you explain the Question #2. Related lines are: `L000b` and `L0010`. – Apr 22 '21 at 19:39
`L0007` zeros out a 128 bit (16 byte) register which is used in `L000b` and `L0010` to zero out 32 bytes of local stack based storage. `rax` is used on the next two lines to zero out another 8 bytes. – 1201ProgramAlarm Apr 22 '21 at 19:42
Just a random question: Is zeroing out necessary? I mean I'll store the values there anyway, right? I guess it has to do something with `false dependencies`? – Apr 22 '21 at 19:45
Zeroing out isn't strictly necessary here (since the stack variables are written to before they are read), but IIRC it is a part of how C# works. – 1201ProgramAlarm Apr 22 '21 at 19:54
1

@Hrant: [SSE code 6x slower without VZEROUPPER on Skylake?](//stackoverflow.com/a/41349852) explains the false-dependency effect *for legacy-SSE instructions*, which this code isn't using. C++ compilers normally use vzeroupper *after* using 256-bit AVX instructions, so call/ret happen in a state where it's safe to use legacy-SSE without transition or false-dep penalties. C# JIT might take a different approach, but `vzeroupper` *before* 256-bit instructions is only ever useful if you've already hit one transition penalty and are in the preserved-dirty-uppers state. (On HSW or ICL, not SKL) – Peter Cordes Apr 23 '21 at 01:37

Understanding C# SIMD output

Questions

Note

3 Answers3