What are these extra disassembly instructions when using SIMD intrinsics?

Question

I'm testing what sort of speedup I can get from using SIMD instructions with RyuJIT and I'm seeing some disassembly instructions that I don't expect. I'm basing the code on this blog post from the RyuJIT team's Kevin Frei, and a related post here. Here's the function:

static void AddPointwiseSimd(float[] a, float[] b) {
    int simdLength = Vector<float>.Count;
    int i = 0;
    for (i = 0; i < a.Length - simdLength; i += simdLength) {
        Vector<float> va = new Vector<float>(a, i);
        Vector<float> vb = new Vector<float>(b, i);
        va += vb;
        va.CopyTo(a, i);
    }
}

The section of disassembly I'm querying copies the array values into the Vector<float>. Most of the disassembly is similar to that in Kevin and Sasha's posts, but I've highlighted some extra instructions (along with my confused annotations) that don't appear in their disassemblies:

;// Vector<float> va = new Vector<float>(a, i);
  cmp eax,r8d              ; <-- Unexpected - Compare a.Length to i?
  jae 00007FFB17DB6D5F     ; <-- Unexpected - Jump to range check failure
  lea r10d,[rax+3] 
  cmp r10d,r8d 
  jae 00007FFB17DB6D5F 
  mov r11,rcx              ; <-- Unexpected - Extra register copy?
  movups xmm0,xmmword ptr [r11+rax*4+10h  ]

;// Vector<float> vb = new Vector<float>(b, i);
  cmp eax,r9d              ; <-- Unexpected - Compare b.Length to i?
  jae 00007FFB17DB6D5F     ; <-- Unexpected - Jump to range check failure
  cmp r10d,r9d 
  jae 00007FFB17DB6D5F 
  movups xmm1,xmmword ptr [rdx+rax*4+10h]

Note the loop range check is as expected:

;// for (i = 0; i < a.Length - simdLength; i += simdLength) {
  add eax,4  
  cmp r9d,eax  
  jg loop

so I don't know why there are extra comparisons to eax. Can anyone explain why I'm seeing these extra instructions and if it's possible to get rid of them.

In case it's related to the project settings I've got a very similar project that shows the same issue here on github (see FloatSimdProcessor.HwAcceleratedSumInPlace() or UShortSimdProcessor.HwAcceleratedSumInPlaceUnchecked()).

first link is 403, I found an [alternative posting here](https://learn.microsoft.com/en-us/archive/blogs/clrcodegeneration/quick-info-about-a-great-simd-writeup) (suggested edit queue is full, otherwise I'd edit the question myself) — Pikalek, Oct 03 '22 at 13:57

Hans Passant · Accepted Answer · 2016-02-11T16:25:36.877

I'll annotate the code generation that I see, for a processor that supports AVX2 like Haswell, it can move 8 floats at a time:

00007FFA1ECD4E20  push        rsi
00007FFA1ECD4E21  sub         rsp,20h  

00007FFA1ECD4E25  xor         eax,eax                       ; i = 0
00007FFA1ECD4E27  mov         r8d,dword ptr [rcx+8]         ; a.Length
00007FFA1ECD4E2B  lea         r9d,[r8-8]                    ; a.Length - simdLength
00007FFA1ECD4E2F  test        r9d,r9d                       ; if (i >= a.Length - simdLength)
00007FFA1ECD4E32  jle         00007FFA1ECD4E75              ; then skip loop 

00007FFA1ECD4E34  mov         r10d,dword ptr [rdx+8]        ; b.Length
00007FFA1ECD4E38  cmp         eax,r8d                       ; if (i >= a.Length)
00007FFA1ECD4E3B  jae         00007FFA1ECD4E7B              ; then OutOfRangeException
00007FFA1ECD4E3D  lea         r11d,[rax+7]                  ; i+7
00007FFA1ECD4E41  cmp         r11d,r8d                      ; if (i+7 >= a.Length)
00007FFA1ECD4E44  jae         00007FFA1ECD4E7B              ; then OutOfRangeException

00007FFA1ECD4E46  mov         rsi,rcx                       ; move a[i..i+7]
00007FFA1ECD4E49  vmovupd     ymm0,ymmword ptr [rsi+rax*4+10h]  

00007FFA1ECD4E50  cmp         eax,r10d                      ; same as above 
00007FFA1ECD4E53  jae         00007FFA1ECD4E7B              ; but for b
00007FFA1ECD4E55  cmp         r11d,r10d  
00007FFA1ECD4E58  jae         00007FFA1ECD4E7B  
00007FFA1ECD4E5A  vmovupd     ymm1,ymmword ptr [rdx+rax*4+10h]  

00007FFA1ECD4E61  vaddps      ymm0,ymm0,ymm1                ; a[i..] + b[i...]
00007FFA1ECD4E66  vmovupd     ymmword ptr [rsi+rax*4+10h],ymm0  

00007FFA1ECD4E6D  add         eax,8                         ; i += 8
00007FFA1ECD4E70  cmp         r9d,eax                       ; if (i < a.Length)
00007FFA1ECD4E73  jg          00007FFA1ECD4E38              ; then loop

00007FFA1ECD4E75  add         rsp,20h  
00007FFA1ECD4E79  pop         rsi  
00007FFA1ECD4E7A  ret

So the eax compares are those "pesky bound checks" that the blog post talks about. The blog post gives an optimized version that is not actually implemented (yet), real code right now checks both the first and the last index of the 8 floats that are moved at the same time. The blog post's comment "Hopefully, we'll get our bounds-check elimination work strengthened enough" is an uncompleted task :)

The mov rsi,rcx instruction is present in the blog post as well and appears to be a limitation in the register allocator. Probably influenced by RCX being an important register, it normally stores this. Not important enough to do the work to get this optimized away I'd assume, register-to-register moves take 0 cycles since they only affect register renaming.

Note how the difference between SSE2 and AVX2 is ugly, while the code moves and adds 8 floats at a time, it only actually uses 4 of them. Vector<float>.Count is 4 regardless of the processor flavor, leaving 2x perf on the table. Hard to hide the implementation detail I guess.

What are these extra disassembly instructions when using SIMD intrinsics?

1 Answers1

Linked