The normal way to sum bytes is with psadbw
against a zeroed register, to sum groups of 8 into the two 64-bit halves. (As mentioned in Sum reduction of unsigned bytes without overflow, using SSE2 on Intel, mentioned by Fastest way to do horizontal SSE vector sum (or other reduction))
That works for unsigned bytes. (Or signed bytes, if you only care about the low 8 bits, i.e. truncating the sum to the element width. Any method that gives the correct truncated sum of unsigned bytes must also work for truncated signed bytes, because signed/unsigned addition is the same binary operation on a 2's complement machine.)
To do a widening sum of signed bytes, range-shift to unsigned first, then subtract the bias at the end. Range-shift from -128..127 to 0..255 by adding 0x80, which is the same thing as flipping the high bit, so we can use pxor
which has better throughput on some CPUs than paddb
). This takes a mask vector constant, but it's still more efficient than a chain of 3 shuffle/add or pmaddubsw
/ pmaddwd
/ pshufd
/paddd
.
You can discard any assemble-time-constant number of bytes with a vector byte-shift. Keeping 9 is a special case, see below. (So is 8 or 4, just movq or movd.) If you need runtime-variable masking, probably load a sliding window from -1, ..., -1, 0, ...
bytes as in Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all
You might consider passing a pointer arg to this function so you can use it on any 9-byte data (as long as it's not near the end of a page so it's safe to to read 16).
;; General case, for any number of bytes from 9 .. 16
;; using SIMD for the low 8 and high 1..8
GetSumOfMasks proc
movdqu xmm1, xmmword ptr [Masks]
pslldq xmm1, 7 ; discard 7 bytes, keep the low 9
pxor xmm1, [mask_80h] ; range shift to unsigned. hoist this constant load out of a loop if inlining
pxor xmm0, xmm0 ; _mm_setzero_si128
psadbw xmm0, xmm1 ; hsum bytes into two 64-bit halves
movd eax, xmm0 ; low part
pextrw edx, xmm0, 4 ; the significant part of the high qword. 2 uops, same as punpckhqdq / movd
lea eax, [rax + rdx - 16 * 80h] ; 1 uop but worse latency than separate sub/add
; or into RAX if you want the result sign-extended to int64_t RAX
; instead of int32_t EAX
ret
endp GetSumOfMasks
.section .rdata ; or however MASM spells this directive
align 16
mask_80h db 16 dup(80h)
Other possibilities for the horizontal sum include doing it before extracting to scalar, like movhlps xmm1, xmm0
(or pshufd
) / paddd xmm0, xmm1
/ movd eax, xmm0
/ sub rax, 16 * 80h
. With another vector constant, you could even paddq
with a -16 * 80h
constant in parallel with the high->low shuffle, creating more ILP, but probably not worth it if the constant would have to come from memory.
Using a single lea
is good for throughput, but not for latency; see Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? (and https://agner.org/optimize/ and https://uops.info/) for details about slow-LEA (3 components, two +
signs in the addressing mode, makes it slow on Intel and AMD.) Ice Lake can still runs "slow LEA" at 1 cycle latency, on port 1 or 5 instead of any port, but SKL and earlier run it with 3 cycle latency, thus only on port 1.
If you can hoist the mask generation out of a loop, you could generate it on the fly, e.g. pcmpeqd xmm1,xmm1
/ SSSE3 pabsb xmm1,xmm1
/ psllw xmm1, 7
I was able to use just movd
and SSE2 pextrw
instead of movq
because we the unsigned sum of 8 bytes definitely fits in 16 bits. That saves code size (REX.W prefixes).
9 bytes is an interesting special case
Use a movq
vector load to get the first 8, and a scalar movsx
to get the left over byte. That way you don't have to mask off unwanted bytes in the high half, and don't need to extract the high 64-bit half of the psadbw result. (Unless maybe you want the full [Masks]
in a register for something?)
; optimized for exactly 9 bytes; SIMD low half, scalar high byte.
GetSumOfMasks proc
movq xmm1, qword ptr [Masks] ; first 8 bytes
movsx eax, byte ptr [Masks+8] ; 9th byte
pxor xmm1, [mask_80h] ; range shift to unsigned. hoist this constant load out of a loop if inlining
; note this is still a 16-byte vector load
pxor xmm0, xmm0 ; _mm_setzero_si128
psadbw xmm0, xmm1 ; hsum bytes into two 64-bit halves
movd edx, xmm0 ; low part
sub rax, 8 * 80h ; add the bias off the critical path. Only 8x biased bytes made it into the final sum
add rax, rdx
;lea eax, [rax + rdx - 8 * 80h] ; save an instruction but costs latency.
ret
endp GetSumOfMasks
To shrink the vector constant to 8 bytes, you'd need to load it separately with movq
. (Or still align it, but put some other constant in the high 8 bytes; those bytes are fully don't-care for this.)
This version is optimized for latency on Intel pre-Ice Lake by doing a sub
of the bias in parallel with the vector dep chain. If your use-case for this involves scalar stores into that Masks array, you might be hitting a store-forwarding stall anyway with the vector load. In which case you should probably just optimize for throughput and keep it off the critical path. But a store-forwarding stall might not happen if the data hasn't been written right before calling this. Still, if you have the data in a vector register, it would be better to pass it that way to the function, instead of bouncing it through static storage.
Prefer the low 8 XMM registers; you can use them without REX prefixes. Also, XMM0..5 are fully call-clobbered in Windows x64, but XMM6..15 are call-preserved. That means you'd have to save/restore any you use.
(I thought I remembered reading once that only the low halves were call-preserved, in which case any functions you call might only restore the low halves, not the whole thing. But https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170 says XMM6-15 (not XMM6L-15L) are "non-volatile")