What is the purpose of the MoveMask for SSE and AVX

Question

Questions

What is the purpose or intention of a MoveMask?
What's the best place to learn how to use x86/x86-64 assembly/SSE/AVX?
Could I have written my code more efficiently?

Reason for Questions

I have an function written in F# for .NET that uses SSE2. I've written the same thing using AVX2 but the underlying question is the same. What is the intended purpose of a MoveMask? I know that it works for my purposes, I want to know why.

I am iterating through two 64-bit float arrays, a and b, testing that all of their values match. I am using the CompareEqual method (which I believe is wrapping a call to __m128d _mm_cmpeq_pd) to compare several values at a time. I then compare that result with a Vector128 of 0.0 64-bit float. My reasoning is that the result of CompareEqual will give a 0.0 value in the cases where the values don't match. Up to this point, it makes sense.

I then use the Sse2.MoveMask method on the result of the comparison with the zero vector. I've previously worked on using SSE and AVX for matching and I saw examples of people using MoveMask for the purpose for testing for non-zero values. I believe this method is using the int _mm_movemask_epi8 Intel intrinsic. I have included the F# code and the assembly that is JITed.

Is this really the intention of a MoveMask or is it just a happy coincidence it works for these purposes. I know my code works, I want to know WHY it works.

F# Code

#nowarn "9" "51" "20" // Don't want warnings about pointers

open System
open FSharp.NativeInterop
open System.Runtime.Intrinsics.X86
open System.Runtime.Intrinsics
open System.Collections.Generic

let sseFloatEquals (a: array<float>) (b: array<float>) =
    if a.Length = b.Length then
        let mutable result = true
        let mutable idx = 0
        
        if a.Length > 3 then
            let lastBlockIdx = a.Length - (a.Length % Vector128<float>.Count)
            let aSpan = a.AsSpan ()
            let bSpan = b.AsSpan ()
            let aPointer = && (aSpan.GetPinnableReference ())
            let bPointer = && (bSpan.GetPinnableReference ())
            let zeroVector = Vector128.Create 0.0

            while idx < lastBlockIdx && result do
                let aVector = Sse2.LoadVector128 (NativePtr.add aPointer idx)
                let bVector = Sse2.LoadVector128 (NativePtr.add bPointer idx)
                let comparison = Sse2.CompareEqual (aVector, bVector)
                let zeroTest = Sse2.CompareEqual (comparison, zeroVector)

                // The line I want to understand
                let matches = Sse2.MoveMask (zeroTest.AsByte ())
                if matches <> 0 then
                    result <- false

                idx <- idx + Vector128.Count

        while idx < a.Length && idx < b.Length && result do
            if a.[idx] <> b.[idx] then
                result <- false

            idx <- idx + 1

        result

    else
        false

Emitted Assembly

; Core CLR 5.0.921.35908 on amd64

_.sseFloatEquals$cont@11(System.Double[], System.Double[], Microsoft.FSharp.Core.Unit)
    L0000: push rdi
    L0001: push rsi
    L0002: push rbp
    L0003: push rbx
    L0004: sub rsp, 0x28
    L0008: vzeroupper
    L000b: mov eax, 1
    L0010: xor r8d, r8d
    L0013: mov r9d, [rcx+8]
    L0017: cmp r9d, 3
    L001b: jle short L008e
    L001d: mov r10d, r9d
    L0020: and r10d, 1
    L0024: mov r11d, r9d
    L0027: sub r11d, r10d
    L002a: lea r10, [rcx+0x10]
    L002e: mov esi, r9d
    L0031: test rdx, rdx
    L0034: jne short L003c
    L0036: xor edi, edi
    L0038: xor ebx, ebx
    L003a: jmp short L0043
    L003c: lea rdi, [rdx+0x10]
    L0040: mov ebx, [rdx+8]
    L0043: xor ebp, ebp
    L0045: test esi, esi
    L0047: je short L004c
    L0049: mov rbp, r10
    L004c: xor r10d, r10d
    L004f: test ebx, ebx
    L0051: je short L0056
    L0053: mov r10, rdi
    L0056: vxorps xmm0, xmm0, xmm0
    L005a: cmp r8d, r11d
    L005d: jge short L008e
    L005f: mov esi, eax
    L0061: test esi, esi
    L0063: je short L008e
    L0065: movsxd rsi, r8d
    L0068: vmovupd xmm1, [rbp+rsi*8]
    L006e: vmovupd xmm2, [r10+rsi*8]
    L0074: vcmpeqpd xmm1, xmm1, xmm2
    L0079: vcmpeqpd xmm1, xmm1, xmm0
    L007e: vpmovmskb esi, xmm1
    L0082: test esi, esi
    L0084: je short L0088
    L0086: xor eax, eax
    L0088: add r8d, 4
    L008c: jmp short L005a
    L008e: cmp r9d, r8d
    L0091: jle short L00c8
    L0093: cmp [rdx+8], r8d
    L0097: jle short L00c8
    L0099: mov r10d, eax
    L009c: test r10d, r10d
    L009f: je short L00c8
    L00a1: cmp r8d, r9d
    L00a4: jae short L00d1
    L00a6: movsxd r10, r8d
    L00a9: vmovsd xmm0, [rcx+r10*8+0x10]
    L00b0: cmp r8d, [rdx+8]
    L00b4: jae short L00d1
    L00b6: vucomisd xmm0, [rdx+r10*8+0x10]
    L00bd: jp short L00c1
    L00bf: je short L00c3
    L00c1: xor eax, eax
    L00c3: inc r8d
    L00c6: jmp short L008e
    L00c8: add rsp, 0x28
    L00cc: pop rbx
    L00cd: pop rbp
    L00ce: pop rsi
    L00cf: pop rdi
    L00d0: ret
    L00d1: call 0x00007ffcef38a370
    L00d6: int3

_.sseFloatEquals(System.Double[], System.Double[])
    L0000: mov r8d, [rcx+8]
    L0004: cmp r8d, [rdx+8]
    L0008: jne short L0012
    L000a: xor r8d, r8d
    L000d: jmp 0x00007ffc99000480
    L0012: xor eax, eax
    L0014: ret

For question #2: I think you should try x86 emulators online. https://carlosrafaelgn.com.br/Asm86/ I don't work on these emulators myself. Use discretion. — Ziaullah Khan, Nov 08 '21 at 04:11

Peter Cordes · Accepted Answer · 2021-11-08T05:02:16.180

MoveMask just extracts the high bit of each element into an integer bitmap. You have 3 element-size options: movmskpd (64-bit), movmskps (32-bit), and pmovmskb (8-bit).

This works well with SIMD compares, which produce an output that has all-zero when the predicate is false, all-one bits in elements where the predicate is true. All-ones is a bit-pattern for -QNaN if interpreted as an IEEE-FP floating-point value, but normally you don't do that. Instead movemask, or AND, (or AND / ANDN / OR or _mm_blend_pd) or things like that with a compare result.

movemask(v) != 0, movemask(v) == 0x3, or movemask(v) == 0 is how you check conditions like at least one element in a compare matched, or all matched, or none matched, respectively, where v is the result of _mm_cmpeq_pd or whatever. (Or just to extract signs directly without a compare).

For other element sizes, 0xf or 0xffff to match all four or all 16 bits. Or for AVX 256-bit vectors, twice as many bits, up to filling a whole 32-bit integer with vpmovmskb eax, ymm0.

What you're doing is really weird, using a 0.0 / NaN compare result as the input to another compare with vcmpeqpd xmm1, xmm1, xmm2 / vcmpeqpd xmm1, xmm1, xmm0. For the 2nd comparison, that can only be true for elements that are == 0.0 (i.e. +-0.0), because x == NaN is false for every x.

If the second vector is a constant zero (let zeroTest = Sse2.CompareEqual (comparison, zeroVector), that's pointless, you're just inverting the compare result which you could have done by checking a different integer condition or against a different constant, not doing runtime comparisons. (0.0 == 0.0 is true, producing an all-ones output, 0.0 == -NaN is false, producing an all-zero output.)

To learn more about intrinsics and SIMD, see for example Agner Fog's optimization guide; his asm guide has a chapter on SIMD. Also, his VectorClass library for C++ has some useful wrappers, and for learning purposes seeing how those wrapper functions implement some basic things could be useful.

To learn what things actually do, see Intel's intrinsics guide. You can search by asm instruction or C++ intrinsic name.

I think MS has docs for their C# System.Runtime.Intrinsics.X86, and I assume F# uses the same intrinsics, but I don't use either language myself.

Related re: comparisons:

Check that at least 1 element is true in each of multiple vectors of compare results - horizontal OR then AND
Get the last line separator - pcmpeqb -> pmovmskb -> bsr to find the position of the last match element in a vector of compare results. Bit-scan reverse on the compare mask. Often you want to scan forward to find the first match (or invert and find first mismatch, like for memcmp). e.g. Compare 16 byte strings with SSE
Or popcount them if you're counting occurrences by matching against a loop-invariant vector of a broadcasted character: How can I count the occurrence of a byte in array using SIMD? - instead of movemask, use the compare result as integer 0 / -1. SIMD subtract from a vector accumulator in the inner loop, then horizontal sum of integer elements in an outer loop.
SIMD instructions for floating point equality comparison (with NaN == NaN) - useful exercise in understanding how NaNs work.

score 2 · Answer 2 · answered Dec 14 '21 at 19:41

2

In addition to what Peter pointed out already.

Yes, MoveMask (movmskp) works well with comparisons, if you need indices as general-purpose integer bask, to do like bsf or popcnt or whatever.

As you are determining just the fact of nonzero, Sse41.TestZ (or Avx.TestZ for AVX) which compiles to ptest could be better, as it makes the result to a flag directly, without populating a general purpose register.

answered Dec 14 '21 at 19:41

Alex Guteniev

12,039
2
34
79

Yup, `ptest same,same` / `jnz` is 3 uops, same as `pcmpeqb xmm0, xmm1` / `pmmovmskb eax,xmm0` / `cmp eax,0xffff + je` (macro-fused cmp+branch), but the PTEST version doesn't need a `_mm_setzero_si128()` vector constant in a register, and doesn't destroy the value (which could cost a movdqa without AVX). It's also smaller machine-code size. Plus if you're using scalar `setcc` or `cmov` which can't macro-fuse with `test` or `cmp` the way a branch can, the PTEST version is fewer uops. – Peter Cordes Dec 14 '21 at 21:01
1

But if you have a compare result already (rather than testing if a vector is all-0 / non-zero), using `ptest` (`Sse41.TestZ`) is worse if you're branching in the result. Only 2 more uops after the pcmp with movemask vs. 3 with ptest. (Intel and AMD can both fuse `cmp/jcc` into a single uop for the relevant cases.) – Peter Cordes Dec 14 '21 at 21:02
Missed the point that there is macro fusion, and there has to be FP comparison. Indeed the compiler/jit did not emit `cmov` where it could: `L0082: test esi, esi L0084: je short L0088` – Alex Guteniev Dec 14 '21 at 21:32

What is the purpose of the MoveMask for SSE and AVX

Questions

Reason for Questions

F# Code

Emitted Assembly

2 Answers2

Linked

Related