I have an unsigned short dst[16][16] matrix and a larger unsigned char src[m][n] matrix.
Now I have to access in the src matrix and add a 16x16 submatrix to dst, using SSE2 or SSE3.
In an older implementation, I was sure that my summed values were never greater than 256, so I could do this:
for (int row = 0; row < 16; ++row)
{
__m128i subMat = _mm_lddqu_si128(reinterpret_cast<const __m128i*>(src));
dst[row] = _mm_add_epi8(dst[row], subMat);
src += W; // Step to the next row I need to add
}
where W is an offset to reach the desired rows. This code works, but now my values in src are larger and summed could be greater than 256, so I need to store them as ushort.
I've tried the following, but it doesn't work.
for (int row = 0; row < 16; ++row)
{
__m128i subMat = _mm_lddqu_si128(reinterpret_cast<const __m128i*>(src));
dst[row] = _mm_add_epi16(dst[row], subMat);
src += W; // Step to the next row I need to add
}
How can I solve this problem?
EDIT
Thank you paul, but I think your offsets are wrong. I've tried your solution and seems that submatrix's rows are added to the wrong dst's rows. I hope the right solution is this:
for (int row = 0; row < 32; row += 2)
{
__m128i subMat = _mm_lddqu_si128(reinterpret_cast<const __m128i*>(src));
__m128i subMatLo = _mm_unpacklo_epi8(subMat, _mm_set1_epi8(0));
__m128i subMatHi = _mm_unpackhi_epi8(subMat, _mm_set1_epi8(0));
dst[row] = _mm_add_epi16(dst[row], subMatLo);
dst[row + 1] = _mm_add_epi16(dst[row + 1], subMatHi);
src += W;
}