Opcode/intrinsic to sum and multiply packed elements

Question

I know modern x86 has opcodes (often supported by compiler intrinsics) to perform element-wise multiplication and summation of packed elements between two arrays. That is, if I have two arrays: int a[4] { ... }, b[4] {...}, there are instructions that will perform the equivalent of:

 int c[4];
...
c[0]=a[0] + b[0];
c[1] = a[1] + b[1]
c[2] = a[2] + b[2];
c[3] = a[3] + b[3];

Or the same for multiplication. But is there an x86 (or x86-64) opcode that would instead give me

long long result = a[0] + a[1] + a[2] + a[3]

in one step? I've tried looking for such in both opcodes lists as well as various matrix multiplication posts (where I know such an instruction would be extremely useful) without success.

so you mean a reduction? you could take a look at https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction — horro, May 05 '22 at 21:04
Okay, I was not aware of calling this a reduction, that gives me lots of new results to look at. — SoronelHaetir, May 05 '22 at 21:18
For float there's also `dpps` against a vector of four 1.0f, but it's usually not faster than shuffle/add, and doesn't work for integer. Oh, I just realized you might want to widen the result to avoid overflow, so maybe not an exact duplicate. (Your C doesn't actually do that; you do an `int` sum and *then* sign-extend the result to `long long` after possible signed-overflow UB.) — Peter Cordes, May 06 '22 at 04:05
You don't need one of these for matrix multiplication btw, you can implement matrix multiplication purely with "vertical" multiplication and addition, by reordering the computation to be more SIMD-friendly: put the SIMD-parallelism towards computing several entries of the result at once, rather than towards accelerating the computation of a single entry. — harold, May 06 '22 at 09:19

Opcode/intrinsic to sum and multiply packed elements

0 Answers0