1

I am writing some code in x64 assembly and using SIMD.

I have 9 bytes packed in the xmm15 register. For simplicity, let's look at the following code:

.data
Masks BYTE 0, -1, 0, -1, 5, -1, 0, -1, 0

.code
GetSumOfMasks proc

movdqu xmm15, xmmword ptr [Masks]
; xmm15 now contains { 0,-1,0,-1,5,-1,0,-1,0,0,0,0,0,0,0,0 }

; sum the elements horizontally - the result should be 1
; convert the sum to a QWORD and put it in RAX

GetSumOfMasks endp

How can I get the horizontal sum of the elements that are in xmm15?

I have tried haddps, but it seems to work with unsigned DWORDs only, and I couldn't find an alternative operating on bytes.

What SSE instruction(s) can I use to sum the signed bytes in this situation?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
nooblet2
  • 31
  • 6
  • 1
    Worse, `haddps` operates on floats :) Also, horizontal additions only add two neighboring elements. Whatever you do, you probably want to add 8 of those bytes, and handle the last one specially. – Jester Dec 15 '21 at 21:28
  • Do you have any suggestions how I can handle this, even when only adding 8 of those bytes? – nooblet2 Dec 15 '21 at 21:32
  • 3
    You probably want `psadbw` against `0`, after range-shifting -128..+127 to unsigned 0..255, and then subtract the bias when you're done. [Fastest way to horizontally sum SSE unsigned byte vector](https://stackoverflow.com/q/36998538). The canonical Q&A [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026) doesn't have a link to a more specific Q&A about signed instead of unsigned bytes, so this may not be a duplicate. – Peter Cordes Dec 15 '21 at 21:36
  • 1
    Is it possible in your setting that the sum will overflow 8 bits? If so, do you want to return the mathematically correct answer extended to 64-bits, or to truncate? – Nate Eldredge Dec 15 '21 at 23:37
  • @NateEldredge: Either way, I suspect it's easiest to `psadbw` and just movzx if truncation is desired. Unless you had in mind using `pmaddubsw` with `9 dup(1), 7 dup(0)` as the first step, to do the masking to ignore high garbage for free (or at least trading latency for throughput), and then shuffle / `paddb` to pack back down? Hmm, any widening madd is going to leave gaps between bytes, although still have the bytes we want at the bottom of an aligned pair for pshufd and then pshuflw? Or a `pmaddwd` or another `pmaddubsw` with a 1,0 repeating mask? Haven't fully thought this through. – Peter Cordes Dec 16 '21 at 00:35
  • @PeterCordes: I was just thinking that if we are going to truncate, then the biasing is unnecessary, seeing as signed and unsigned addition are the same. – Nate Eldredge Dec 16 '21 at 00:45
  • @NateEldredge: ahh, I see. Yeah good point. `-1` vs. 0 gives a SAD of 255, but that's our `-1` again. And if there was a `+1` element in there, it would wrap the low byte back to 0, so -1 + +1 = 0. Yeah, you're right, it is effectively equivalent to doing zero-extension to 64-bit before adding, so if you don't care about the high bytes then you don't need to bias, and can just `movzx eax, dl` your final result. – Peter Cordes Dec 16 '21 at 00:51

1 Answers1

2

The normal way to sum bytes is with psadbw against a zeroed register, to sum groups of 8 into the two 64-bit halves. (As mentioned in Sum reduction of unsigned bytes without overflow, using SSE2 on Intel, mentioned by Fastest way to do horizontal SSE vector sum (or other reduction))

That works for unsigned bytes. (Or signed bytes, if you only care about the low 8 bits, i.e. truncating the sum to the element width. Any method that gives the correct truncated sum of unsigned bytes must also work for truncated signed bytes, because signed/unsigned addition is the same binary operation on a 2's complement machine.)

To do a widening sum of signed bytes, range-shift to unsigned first, then subtract the bias at the end. Range-shift from -128..127 to 0..255 by adding 0x80, which is the same thing as flipping the high bit, so we can use pxor which has better throughput on some CPUs than paddb). This takes a mask vector constant, but it's still more efficient than a chain of 3 shuffle/add or pmaddubsw / pmaddwd / pshufd/paddd.

You can discard any assemble-time-constant number of bytes with a vector byte-shift. Keeping 9 is a special case, see below. (So is 8 or 4, just movq or movd.) If you need runtime-variable masking, probably load a sliding window from -1, ..., -1, 0, ... bytes as in Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all

You might consider passing a pointer arg to this function so you can use it on any 9-byte data (as long as it's not near the end of a page so it's safe to to read 16).

;; General case, for any number of bytes from 9 .. 16
;; using SIMD for the low 8 and high 1..8
GetSumOfMasks proc
    movdqu   xmm1, xmmword ptr [Masks]
    pslldq   xmm1, 7                        ; discard 7 bytes, keep the low 9

    pxor     xmm1, [mask_80h]              ; range shift to unsigned.  hoist this constant load out of a loop if inlining

    pxor     xmm0, xmm0                     ; _mm_setzero_si128
    psadbw   xmm0, xmm1                     ; hsum bytes into two 64-bit halves
    movd     eax, xmm0                      ; low part
    pextrw   edx, xmm0, 4                   ; the significant part of the high qword.  2 uops, same as punpckhqdq / movd
    lea      eax, [rax + rdx - 16 * 80h]    ; 1 uop but worse latency than separate sub/add
         ; or into RAX if you want the result sign-extended to int64_t RAX
         ; instead of int32_t EAX
    ret
endp     GetSumOfMasks
    
.section .rdata                 ; or however MASM spells this directive
   align 16
   mask_80h  db  16 dup(80h)

Other possibilities for the horizontal sum include doing it before extracting to scalar, like movhlps xmm1, xmm0 (or pshufd) / paddd xmm0, xmm1 / movd eax, xmm0 / sub rax, 16 * 80h. With another vector constant, you could even paddq with a -16 * 80h constant in parallel with the high->low shuffle, creating more ILP, but probably not worth it if the constant would have to come from memory.

Using a single lea is good for throughput, but not for latency; see Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? (and https://agner.org/optimize/ and https://uops.info/) for details about slow-LEA (3 components, two + signs in the addressing mode, makes it slow on Intel and AMD.) Ice Lake can still runs "slow LEA" at 1 cycle latency, on port 1 or 5 instead of any port, but SKL and earlier run it with 3 cycle latency, thus only on port 1.

If you can hoist the mask generation out of a loop, you could generate it on the fly, e.g. pcmpeqd xmm1,xmm1 / SSSE3 pabsb xmm1,xmm1 / psllw xmm1, 7

I was able to use just movd and SSE2 pextrw instead of movq because we the unsigned sum of 8 bytes definitely fits in 16 bits. That saves code size (REX.W prefixes).


9 bytes is an interesting special case

Use a movq vector load to get the first 8, and a scalar movsx to get the left over byte. That way you don't have to mask off unwanted bytes in the high half, and don't need to extract the high 64-bit half of the psadbw result. (Unless maybe you want the full [Masks] in a register for something?)

; optimized for exactly 9 bytes; SIMD low half, scalar high byte.
GetSumOfMasks proc
    movq     xmm1, qword ptr [Masks]       ; first 8 bytes
    movsx    eax,   byte ptr [Masks+8]     ; 9th byte

    pxor     xmm1, [mask_80h]              ; range shift to unsigned.  hoist this constant load out of a loop if inlining
                                 ; note this is still a 16-byte vector load

    pxor     xmm0, xmm0                     ; _mm_setzero_si128
    psadbw   xmm0, xmm1                     ; hsum bytes into two 64-bit halves
    movd     edx, xmm0                      ; low part
    sub      rax, 8 * 80h                      ; add the bias off the critical path.  Only 8x biased bytes made it into the final sum
    add      rax, rdx
    ;lea      eax, [rax + rdx - 8 * 80h]    ; save an instruction but costs latency.
    ret
endp     GetSumOfMasks

To shrink the vector constant to 8 bytes, you'd need to load it separately with movq. (Or still align it, but put some other constant in the high 8 bytes; those bytes are fully don't-care for this.)

This version is optimized for latency on Intel pre-Ice Lake by doing a sub of the bias in parallel with the vector dep chain. If your use-case for this involves scalar stores into that Masks array, you might be hitting a store-forwarding stall anyway with the vector load. In which case you should probably just optimize for throughput and keep it off the critical path. But a store-forwarding stall might not happen if the data hasn't been written right before calling this. Still, if you have the data in a vector register, it would be better to pass it that way to the function, instead of bouncing it through static storage.


Prefer the low 8 XMM registers; you can use them without REX prefixes. Also, XMM0..5 are fully call-clobbered in Windows x64, but XMM6..15 are call-preserved. That means you'd have to save/restore any you use.

(I thought I remembered reading once that only the low halves were call-preserved, in which case any functions you call might only restore the low halves, not the whole thing. But https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170 says XMM6-15 (not XMM6L-15L) are "non-volatile")

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thank you very much for the detailed answer! The only thing that I had to change, was to move [mask_80h] into an XMM register first, before using `pxor`. The compiler wouldn't let me use `pxor xmm1, [mask_80h]` directly, even when I marked the memory with `xmmword ptr`. After making that change, the code works perfectly. – nooblet2 Dec 16 '21 at 13:16
  • @nooblet2: MASM wouldn't assemble `pxor xmm1, xmmword ptr [mask_80h]`? What does it complain about? I wondered if it would need a size specifier, and kinda forgot about adding one in the end. (NASM and GAS don't have MASM's annoying stuff where it matters how you declare data after a label.) An assembler making you follow its syntax quirks to get the machine code you want is one thing, but an assembler that forces you to write less efficient code would be garbage. I assume MASM has a way to do this, otherwise I'd recommend switching to a good assembler like NASM. – Peter Cordes Dec 16 '21 at 13:20
  • The error that I got was "A2022 - instruction operands must be the same size", regardless if I wrote `pxor xmm1, [mask_80h]` or `pxor xmm1, xmmword ptr [mask_80h]`. Adding `movdqu xmm2, xmmword ptr [mask_80h]` then `pxor xmm1, xmm2` is not an issue for me and it fixed the problem. – nooblet2 Dec 16 '21 at 13:52
  • @nooblet2: but it costs another uop and more code size. The whole point of writing in assembler is performance, so how is that "not an issue"? Normally you'd just write in C with intrinsics. It's very rare to actually want to use hand-written assembler for real, other than to play around with a performance experiment, but it only makes sense for cases where you can't get a C compiler to emit as that's as good. But this is worse than what you'd get for straightforward C with intrinsics, I expect. Especially with this short function that can't inline so it has to actually be `call`ed. – Peter Cordes Dec 16 '21 at 13:56
  • https://godbolt.org/z/PjY73sbKz shows MSVC-generated asm using `pxor xmm0, XMMWORD PTR __xmm@80808080808080808080808080808080`. (And yes, that's how MSVC chooses symbol names for its vector constants, presumably to enable merging of multiple that have the same value, by encoding the value into the symbol name.) So presumably some version of MASM can assemble that, although it's not guaranteed. – Peter Cordes Dec 16 '21 at 14:02
  • Do you think this could be an issue with the compiler that I used, then? – nooblet2 Dec 16 '21 at 14:04
  • @nooblet2: presumably with the *assembler* you used, yes. It's asm, we assemble it, not compile it. I'm not a fan of MASM, but it is AFAIK usable, not garbage, so I'd be really surprised if there was a version without a way to write a PXOR with a memory source operand, using a RIP-relative addressing mode for a label. – Peter Cordes Dec 16 '21 at 14:12