This first version was posted by the OP as an edit to the question where it doesn't belong. Moved it to a community-wiki answer just for posterity.
That first version is absolute garbage for performance, the worst possible way to vectorize, doing an hsum down to scalar inside the inner-most loop, and doing a manual gather with insert_epi32
not even a 4x4 transpose.
Update:
Woho! I finally figured it out. Besides the errors in my logic (thanks for the help Peter Cordes) there was also the issue of _mm_mul_epi32()
not working as I thought it did - I should've been using _mm_mullo_epi32()
instead!
I know this is not the most effective code, but it was made to get it to work properly first - now I can move on to optimizing it.
(Note, don't use this, it's very very slow)
// editor's note: this is the most naive and horrible way to vectorize
void matmulSSE_inefficient(int mat1[N][N], int mat2[N][N], int result[N][N]) {
int i, j, k;
__m128i vA, vB, vR, vSum;
for(i = 0; i < N; ++i) {
for(j = 0; j < N; ++j) {
vR = _mm_setzero_si128();
for(k = 0; k < N; k += 4) {
//result[i][j] += mat1[i][k] * mat2[k][j];
vA = _mm_loadu_si128((__m128i*)&mat1[i][k]);
// less braindead would be to start vB with movd, avoiding a false dep and one shuffle uop.
// vB = _mm_cvtsi32_si128(mat2[k][j]); // but this manual gather is still very bad
vB = _mm_insert_epi32(vB, mat2[k][j], 0); // false dependency on old vB
vB = _mm_insert_epi32(vB, mat2[k + 1][j], 1); // bad spatial locality
vB = _mm_insert_epi32(vB, mat2[k + 2][j], 2); // striding down a column
vB = _mm_insert_epi32(vB, mat2[k + 3][j], 3);
vR = _mm_mullo_epi32(vA, vB);
vR = _mm_hadd_epi32(vR, vR); // very slow inside the inner loop
vR = _mm_hadd_epi32(vR, vR);
result[i][j] += _mm_extract_epi32(vR, 0);
//DEBUG
//printf("vA: %d, %d, %d, %d\n", vA.m128i_i32[0], vA.m128i_i32[1], vA.m128i_i32[2], vA.m128i_i32[3]);
//printf("vB: %d, %d, %d, %d\n", vB.m128i_i32[0], vB.m128i_i32[1], vB.m128i_i32[2], vB.m128i_i32[3]);
//printf("vR: %d, %d, %d, %d\n", vR.m128i_i32[0], vR.m128i_i32[1], vR.m128i_i32[2], vR.m128i_i32[3]);
//printf("\n");
}
}
}
}
End of extremely inefficient code originally written by the OP
Update 2: converted the OP's example to an i-k-j loop order version. Required an extra load for vR and moving the store into the inner loop, but setting vA could be moved up a loop. Turned out faster.
// this is significantly better but doesn't do any cache-blocking
void matmulSSE_v2(int mat1[N][N], int mat2[N][N], int result[N][N]) {
int i, j, k;
__m128i vA, vB, vR;
for(i = 0; i < N; ++i) {
for(k = 0; k < N; ++k) {
vA = _mm_set1_epi32(mat1[i][k]);
for(j = 0; j < N; j += 4) {
//result[i][j] += mat1[i][k] * mat2[k][j];
vB = _mm_loadu_si128((__m128i*)&mat2[k][j]);
vR = _mm_loadu_si128((__m128i*)&result[i][j]);
vR = _mm_add_epi32(vR, _mm_mullo_epi32(vA, vB));
_mm_storeu_si128((__m128i*)&result[i][j], vR);
//DEBUG
//printf("vA: %d, %d, %d, %d\n", vA.m128i_i32[0], vA.m128i_i32[1], vA.m128i_i32[2], vA.m128i_i32[3]);
//printf("vB: %d, %d, %d, %d\n", vB.m128i_i32[0], vB.m128i_i32[1], vB.m128i_i32[2], vB.m128i_i32[3]);
//printf("vR: %d, %d, %d, %d\n", vR.m128i_i32[0], vR.m128i_i32[1], vR.m128i_i32[2], vR.m128i_i32[3]);
//printf("\n");
}
}
}
}
These assume N is a multiple of 4, the vector width
If that's not the case, often it's easier to still pad your array storage to a multiple of the vector width, so there's padding at the end of each row and you can just use that simple j < N; j += 4
loop condition. You'll want to keep track of the real N
size separately from the storage layout with a row stride that's a multiple of 4 or 8.
Otherwise you want a loop condition like j < N-3
; j += 4`, and a scalar cleanup for the end of a row.
Or masking or keeping the last full vector in a register so you can _mm_alignr_epi8
with a maybe-overlapping final vector that ends at the end of the row, and maybe do a vector store. This is easier with AVX or especially AVX512 masking.