On paper I drew out the long form of this algorithm, and on paper it should work fine. Am I running into a subtlety with register casting (256/128/256), or did I actually mess up the algorithm structure somewhere?
For convenience, I've put the vanilla code and the AVX code up on the Godbolt viewer so you can see the generated assembly at will.
Standard code https://godbolt.org/g/v47RKH
My AVX Attempt 1: https://godbolt.org/g/oH1DpO
My AVX Attempt 2: https://godbolt.org/g/QFtdKr (Shaved 5 cycles and reduced casting needs, easier to read)
The SSE code oddly enough is using scalar operations, which boggles my mind since that can definitely be accelerated with horizontal broadcasts, muls, and adds. What I'm trying to do is take that concept up one level.
RHS never needs to be changed, but essentially if LHS is {a, b, ..., p}, and LHS is {1, 2, ..., 16}, then we just need 2 registers to hold the 2 halves of RHS and then 2 registers to hold a given row of LHS in the forms {a, a, a, a, b, b, b, b} and {c, c, c, c, d, d, d, d}. This is achieved via 2 broadcasts and a 256/128/256 cast.
We get the intermediate results of
{a*1, a*2, a*3, a*4, b*5, b*6, b*7, b*8} => row[0]
and
{c*9, c*10, c*11, c*12, d*13, d*14, d*15, d*16} => row[1]
And this is unrolled once w.r.t LHS so we generate
{e*1, ... f*8}, {g*9, ... h*16} => row[2], row[3]
Next add r0,r1 and r2,r3 together (keeping r0 and r2 as the current intermediates)
Finally, extract the high half of row[0] to the low half of resHalf, insert the low half of row[2] into the high half of resHalf, insert the high half of row[2] into the high half of row[0], and then add row[0] to resHalf.
By all rights, that should leave us with resHalf[0] equaling the following at the end of iteration i = 0
{a*1 + b*2 + c*3 + d*4, a*5 + b*6 + c*7 + d*8,
a*9 + b*10 + c*11 + d*12, a*13 + b*14 + c*15 + d*16,
e*1 + ... + h*4, e*5 + ... + h*8,
e*9 + ... + h*12, e*13 + ... + h*16}
What my algorithm is producing, however, is the following:
2x {a*1 + c*3, a*5 + c*7, a*9 + c*11, a*13 + c*15},
2x {e*1 + g*3, e*5 + g*7, e*9 + g*11, e*13 + g*15}
And what's scarier still is if I swap rhsHolders[0/1] in the ternary conditional, it doesn't change the results at all. It's as though the compiler is ignoring one of the swaps and adds. Both Clang 4 and GCC 7 do this, so where did I screw up?
EDIT: output should be 4 rows of {10, 26, 42, 58}, but I get {4, 12, 20, 28}