0

I have the following code in x64 Microsoft Macro Assembler (simplified example):

.DATA

First BYTE -4, -3, -2, -1, 0, 1, 2, 3
Second BYTE 1, 2, 3, 4, 5, 6, 7, 8

.CODE

MultiplyAndSum PROC

; move First and Second to vectors
; multiply corresponding elements
; sum the results
; return the sum

MultiplyAndSum ENDP

What I want to achieve in that procedure, is multiply corresponding bytes from the two arrays using SIMD (doesn't matter which registers are used exactly), then sum the results. So in this case, I want to do:

-4 * 1 + (-3) * 2 + ... + 3 * 8 = 24

and return 24.

Is this achievable using vector instructions?

From what I've seen, most multiplication instructions operate on WORDs or DWORDs - therefore, is there a way to split the multiplication into pieces and operate on for example WORDs instead of BYTEs?

The instructions pmaddwd, pmullw or pmulhw seem of no use to me in this case. Are there any that I am missing?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • What instruction set extensions are you permitted to use? – fuz Dec 18 '21 at 10:19
  • All of them - MMX, SSE, AVX - the key point is to perform the operations using vectors, but it doesn't matter what instructions or registers are used exactly. – assemblyhelp Dec 18 '21 at 10:28
  • 1
    If one of the vectors is signed-positive, it can work as the unsigned operand to `pmaddubsw` to add and pairwise hsum. Otherwise no, I don't think so, but check Agner Fog's vectorclass library to see how it emulates multiply for `Vec16c` https://github.com/vectorclass/version2/blob/fee0601edd3c99845f4b7eeb697cff0385c686cb/vectori128.h#L1308. Of course that's not hsumming the result; packing it back into separate 8-bit sums takes a bunch of extra work. – Peter Cordes Dec 18 '21 at 10:31
  • 1
    You can sign-extend `int8` to `int16` using `pmovsxbw` before doing `pmaddwd` (you can then accumulate up to (2^16-1) of these intermediate results with `paddd` before doing further horizontal reductions). If one input vector was actually unsigned you could use `pmaddubsw` as @PeterCordes suggested, but you need to be very careful with overflows (they will saturate the result). – chtz Dec 18 '21 at 14:21
  • Sign-extending through `pmovsxbw`, then using `pmaddwd` works right - what I need now is to horizontally sum the four 32-bit results of `pmaddwd`. I tried `phaddd`, but it didn't give the correct result. How can I do that? – assemblyhelp Dec 18 '21 at 16:45
  • For horizontal reduction, Peter wrote an extensive answer here: https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction – chtz Dec 18 '21 at 20:01

0 Answers0