I'm learning to use SIMD capabilities by re-writing my personal image processing library using vector intrinsics. One basic function is a simple "array +=
," i.e.
void arrayAdd(unsigned char* A, unsigned char* B, size_t n) {
for(size_t i=0; i < n; i++) { B[i] += A[i] };
}
For arbitrary array lengths, the obvious SIMD code (assuming aligned by 16) is something like:
size_t i = 0;
__m128i xmm0, xmm1;
n16 = n - (n % 16);
for (; i < n16; i+=16) {
xmm0 = _mm_load_si128( (__m128i*) (A + i) );
xmm1 = _mm_load_si128( (__m128i*) (B + i) );
xmm1 = _mm_add_epi8( xmm0, xmm1 );
_mm_store_si128( (__m128i*) (B + i), xmm1 );
}
for (; i < n; i++) { B[i] += A[i]; }
But is it possible to do all the additions with SIMD instructions? I thought of trying this:
__m128i mask = (0x100<<8*(n - n16))-1;
_mm_maskmoveu_si128( xmm1, mask, (__m128i*) (B + i) );
for the extra elements, but will that result in undefined behavior? The mask
should guarantee no access is actually made past the array bounds (I think). The alternative is to do the extra elements first, but then the array needs to be aligned by n-n16
, which doesn't seem right.
Is there another, more optimal pattern such vectorized loops?