8

I'm learning to use SIMD capabilities by re-writing my personal image processing library using vector intrinsics. One basic function is a simple "array +=," i.e.

void arrayAdd(unsigned char* A, unsigned char* B, size_t n) {
    for(size_t i=0; i < n; i++) { B[i] += A[i] };
}

For arbitrary array lengths, the obvious SIMD code (assuming aligned by 16) is something like:

size_t i = 0;
__m128i xmm0, xmm1;
n16 = n - (n % 16);
for (; i < n16; i+=16) {
    xmm0 = _mm_load_si128( (__m128i*) (A + i) );
    xmm1 = _mm_load_si128( (__m128i*) (B + i) );
    xmm1 = _mm_add_epi8( xmm0, xmm1 );
    _mm_store_si128( (__m128i*) (B + i), xmm1 );
}
for (; i < n; i++) { B[i] += A[i]; }

But is it possible to do all the additions with SIMD instructions? I thought of trying this:

__m128i mask = (0x100<<8*(n - n16))-1;
_mm_maskmoveu_si128( xmm1, mask, (__m128i*) (B + i) );

for the extra elements, but will that result in undefined behavior? The mask should guarantee no access is actually made past the array bounds (I think). The alternative is to do the extra elements first, but then the array needs to be aligned by n-n16, which doesn't seem right.

Is there another, more optimal pattern such vectorized loops?

reve_etrange
  • 2,561
  • 1
  • 22
  • 36
  • you could ensure that in your code the array lengths are always multiples of 16 bytes (though possibly less elements are actually used), so this epilog never comes up. But the epilog is really not important in terms of speed. – Walter Apr 16 '12 at 15:34

1 Answers1

7

One option is to pad your array to a multiple of 16 bytes. Then you can do 128 bit load/add/store and simply ignore the results following the point you care about.

For large arrays though the overhead of the byte by byte "epilog" is going to be very small. Unrolling the loop may improve performance more, something like:

for (; i < n32; i+=32) {
    xmm0 = _mm_load_si128( (__m128i*) (A + i) );
    xmm1 = _mm_load_si128( (__m128i*) (B + i) );
    xmm2 = _mm_load_si128( (__m128i*) (A + i + 16) );
    xmm3 = _mm_load_si128( (__m128i*) (B + i + 16) );
    xmm1 = _mm_add_epi8( xmm0, xmm1 );
    xmm3 = _mm_add_epi8( xmm2, xmm3 );
    _mm_store_si128( (__m128i*) (B + i), xmm1 );
    _mm_store_si128( (__m128i*) (B + i + 16), xmm3 );
}
// Do another 128 bit load/add/store here if required

But it's hard to say without doing some profiling.

You could also do an unaligned load/store at the end (assuming you have more than 16 bytes) though this will probably not make a big difference. E.g. if you have 20 bytes you do one load/store to offset 0 and another unaligned load/add/store (_mm_storeu_si128, __mm_loadu_si128) to offset 4.

You could use _mm_maskmoveu_si128 but you need to get the mask into an xmm register, and your sample code isn't going to work. You probably want to set the mask register to all FF's and then use a shift to align it. At the end of the day, it will probably be slower than the unaligned load/add/store.

This would be something like:

mask = _mm_cmpeq_epi8(mask, mask); // Set to all FF's
mask = _mm_srli_si128(mask, 16-(n%16)); // Align mask
_mm_maskmoveu_si128(xmm, mask, A + i);
Guy Sirton
  • 8,331
  • 2
  • 26
  • 36
  • In practice I would put the masks in a lookup table. Do you think it would still be slower than the "epilog" loop? – reve_etrange Apr 16 '12 at 03:05
  • @reve_etrange: Likely not slower but it is difficult to know without measuring the two solutions. Give it a try. – Guy Sirton Apr 16 '12 at 03:43
  • I'll give it a shot. But is it a legal memory access? Since *some* value of `mask` could cause an array bounds violation. – reve_etrange Apr 16 '12 at 03:46
  • @reve_etrange: The store is mostly OK, it is byte by byte, you'll still need to do a load though so you'll need to be a bit careful there. If the address is illegal you will get an exception on either the load or the store but as long as the address is good only the affected bytes will be touched. – Guy Sirton Apr 16 '12 at 03:50
  • @reve_etrange: there is that risk but it doesn't mean everything has to be padded, if you have vectors in continguous memory it can be guaranteed safe if the very last one is padded. – Guy Sirton Apr 16 '12 at 03:57
  • @reve_etrange: my reading of Intel's docs here is that MASKMOVDQU will not cause an exception if the address of bytes not written is illegal (unless the mask is zero). So it's safe as long as you check for zero. – Guy Sirton Apr 16 '12 at 04:05
  • OK, but the final array would *need* to be padded to prevent a segfault on the load operation? – reve_etrange Apr 16 '12 at 04:20
  • @reve_etrange: Yes, so B in your example because you are reading and then writing back... – Guy Sirton Apr 16 '12 at 04:51
  • You don't need the padding to avoid illegal access - the "end bit" can never cross a page boundary anyway. (assuming the beginning is 16-aligned, which it should be) – harold Apr 16 '12 at 14:06
  • **`_mm_maskmoveu_si128` is disastrously slow, with an NT hint**. You basically never want to use it for this. Only the AVX version that does masking at 32-bit element-size granularity. See [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933) for some ideas on generating the mask with an unaligned load from a window of 0xFF and 0x00 bytes. Note that **`_mm_srli_si128` (`psrldq`) needs the byte-shift count to be a compile-time constant** so that slow way doesn't even work. – Peter Cordes May 10 '22 at 13:50
  • For arrays bigger than 1 vector, ideally you can do an unaligned load that ends at the end of the arrays, so your final store is potentially overlapping. Takes some extra work when you're modifying one array in-place especially in a non-idempotent way, e.g. doing that load before looping through the array, potentially costing an extra TLB and cache miss. – Peter Cordes May 10 '22 at 13:54
  • Or maybe end the vector loop 1.x vectors early, so you do 2 pairs of loads, one normal and one potentially overlapping it, before you do either store. More code size from more peeling, but it can avoid ever needing a scalar cleanup loop. (Except for really short arrays, shorter than 1 full-size vector. But that code can be "cold", hopefully put somewhere that it doesn't normally take space in the I-cache.) – Peter Cordes May 10 '22 at 13:58