3

I have an unsigned short dst[16][16] matrix and a larger unsigned char src[m][n] matrix.

Now I have to access in the src matrix and add a 16x16 submatrix to dst, using SSE2 or SSE3.

In an older implementation, I was sure that my summed values were never greater than 256, so I could do this:

for (int row = 0; row < 16; ++row)
{
    __m128i subMat = _mm_lddqu_si128(reinterpret_cast<const __m128i*>(src));
    dst[row] = _mm_add_epi8(dst[row], subMat);
    src += W; // Step to the next row I need to add
}

where W is an offset to reach the desired rows. This code works, but now my values in src are larger and summed could be greater than 256, so I need to store them as ushort.

I've tried the following, but it doesn't work.

for (int row = 0; row < 16; ++row)
{
    __m128i subMat = _mm_lddqu_si128(reinterpret_cast<const __m128i*>(src));
    dst[row] = _mm_add_epi16(dst[row], subMat);
    src += W; // Step to the next row I need to add
}

How can I solve this problem?

EDIT

Thank you paul, but I think your offsets are wrong. I've tried your solution and seems that submatrix's rows are added to the wrong dst's rows. I hope the right solution is this:

for (int row = 0; row < 32; row += 2)
{
    __m128i subMat = _mm_lddqu_si128(reinterpret_cast<const __m128i*>(src));
    __m128i subMatLo = _mm_unpacklo_epi8(subMat, _mm_set1_epi8(0));
    __m128i subMatHi = _mm_unpackhi_epi8(subMat, _mm_set1_epi8(0));
    dst[row] = _mm_add_epi16(dst[row], subMatLo);
    dst[row + 1] = _mm_add_epi16(dst[row + 1], subMatHi);
    src += W;
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
pompolus
  • 45
  • 1
  • 6
  • 1
    Don't use _mm_set1_epi8() for a value of 0, if the compiler is not paying attention it can expand into multiple instructions. Use _mm_setzero_si128() instead; it's guaranteed to be a single instruction (XOR). – BitBank Nov 10 '12 at 17:20
  • @BitBank: any decent compiler should deal with this OK, but if you suspect your compiler is not doing a good job with this loop then you can just hoist the zero constant out of the loop. – Paul R Nov 10 '12 at 19:15

1 Answers1

3

You need to unpack your vector of 16 x 8 bit values into two vectors of 8 x 16 bit values and then add both these vectors to your destination:

for (int row = 0; row < 16; ++row)
{
    __m128i subMat = _mm_lddqu_si128(reinterpret_cast<const __m128i*>(src));
    __m128i subMatLo = _mm_unpacklo_epi8(subMat, _mm_set1_epi8(0));
    __m128i subMatHi = _mm_unpackhi_epi8(subMat, _mm_set1_epi8(0));
    dst[row] = _mm_add_epi16(dst[row], subMatLo);
    dst[row + 1] = _mm_add_epi16(dst[row + 1], subMatHi);
    src += W;
}
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • are you sure it is right? I think offsets are wrong.I've edited my question with your edited solution – pompolus Nov 10 '12 at 16:13
  • Well it's not clear from your question what you're trying to achieve exactly - it would be better if you had provided working scalar code as a reference. The general principle is correct however, i.e. you need to unpack to 16 bits. The details for the remaining parts of the code shouldn't be too hard to work out, but if you can clarify what you want the code to do then I can probably help further. – Paul R Nov 10 '12 at 19:10
  • probably my english is not the best, but your code helped me a lot. By the way, I basically need to add every n-th row of a char matrix to the n-th row of the ushort matrix. To achieve that, i need to iterate until 32 and add 2 to iterator each loop. With your old posted code, the rows were added in the wrong places – pompolus Nov 11 '12 at 01:32
  • I had never heard of `_mm_lddqu_si128` until today. What's special about it. Why is it not used more? What's better about it than `_mm_loadu_si128`? – Z boson Jul 14 '16 at 08:34
  • @Zboson: actually I only used it because that was what the OP had started out with. Looking at the docs, apparently it "may perform better than `_mm_loadu_si128` when the data crosses a cache line boundary." The instruction is `lddqu` and it's only available with SSE3 or later. – Paul R Jul 14 '16 at 09:10
  • 1
    According to [this comment](https://stackoverflow.com/questions/24816728/fastest-way-to-transpose-4x4-byte-matrix/24819208#comment64138383_24819208) it "will perform better under certain circumstances, but never perform worse. " It sounds like people should be using it more. – Z boson Jul 14 '16 at 09:13
  • @Zboson: yes, given that SSE3 is pretty much the baseline these days, it should probably be regarded as the default for unaligned loads. – Paul R Jul 14 '16 at 09:29
  • [I asked a question](https://stackoverflow.com/questions/38370622/a-faster-integer-sse-unalligned-load-thats-rarely-used). – Z boson Jul 14 '16 at 09:34