I need to calculate a 2D matrix multiplied with 2D vector. Both use 32 bit floats. I'm hoping to do this using SSE (any version really) for speed optimization purposes, as I'm going to be using it for realtime audio processing.
So the formula I would need is the following:
left = L*A + R*B
right = L*C + R*D
I was thinking of reading the whole matrix from memory as a 128 bit floating point SIMD (4 x 32 bit floating points) if it makes sense. But if it's a better idea to process this in smaller pieces, then that's fine too.
L & R variables will be in their own floats when the processing begin, so they would need to be moved into the SIMD register/variable and when the calculation is done, moved back into regular variables.
The IDEs I'm hoping to get it compiled on are Xcode and Visual Studio. So I guess that'll be Clang and Microsoft's own compilers then which this would need to run properly on.
All help is welcome. Thank you in advance!
I already tried reading SSE instruction sets, but there seems to be so much content in there that it would take a very long time to find the suitable instructions and then the corresponding intrinsics to get anything working.
ADDITIONAL INFORMATION BASED ON YOUR QUESTIONS:
The L & R data comes from their own arrays of data. I have pointers to each of the two arrays (L & R) and then go through them at the same time. So the left/right audio channel data is not interleaved but have their own pointers. In other words, the data is arranged like: LLLLLLLLL RRRRRRRRRR.
Some really good points have been made in the comments about the modern compilers being able to optimize the code really well. This is especially true when multiplication is quite fast and shuffling data inside the SIMD registers might be needed: using more multiplications might still be faster than having to shuffle the data multiple times. I didn't realise that modern compilers can be that good these days. I have to experiment with Godbolt using std::array and seeing what kind of results I'll get for my particular case.
The data needs to be in 32 bit floats, as that is used all over the application. So 16 bit doesn't work for my case.
MORE INFORMATION BASED ON MY TESTS:
I used Godbolt.org to test how the compiler optimizes my code. What I found is that if I do the following, I don't get optimal code:
using Vec2 = std::array<float, 2>;
using Mat2 = std::array<float, 4>;
Vec2 Multiply2D(const Mat2& m, const Vec2& v)
{
Vec2 result;
result[0] = v[0]*m[0] + v[1]*m[1];
result[1] = v[0]*m[2] + v[1]*m[3];
return result;
}
But if I do the following, I do get quite nice code:
using Vec2 = std::array<float, 2>;
using Mat2 = std::array<float, 4>;
Vec2 Multiply2D(const Mat2& m, const Vec2& v)
{
Vec2 result;
result[0] = v[0]*m[0] + v[1]*m[2];
result[1] = v[0]*m[1] + v[1]*m[3];
return result;
}
Meaning that if I transpose the 2D matrix, the compiler seems to output pretty good results as is. I believe I should go with this method since the compiler seems to be able to handle the code nicely.