I know modern x86 has opcodes (often supported by compiler intrinsics) to perform element-wise multiplication and summation of packed elements between two arrays. That is, if I have two arrays: int a[4] { ... }, b[4] {...}, there are instructions that will perform the equivalent of:
int c[4];
...
c[0]=a[0] + b[0];
c[1] = a[1] + b[1]
c[2] = a[2] + b[2];
c[3] = a[3] + b[3];
Or the same for multiplication. But is there an x86 (or x86-64) opcode that would instead give me
long long result = a[0] + a[1] + a[2] + a[3]
in one step? I've tried looking for such in both opcodes lists as well as various matrix multiplication posts (where I know such an instruction would be extremely useful) without success.