4

When I process N bytes of data with SIMD instructions (reading at least 16 bytes at once), normally I simply add padding to the end of the buffer, so I can safely round up the number of 16-byte blocks to read. However, this time I need to process data prepared by an external code, so theoretically it can happen that the last 16-byte vector of data partially falls outside of the allocated memory range.

For example, let's imagine I have stored 22 bytes of data, starting from 1FFF FFE4:

1FFF FFE0: 00 00 00 00 01 02 03 04 05 06 07 08 09 0A 0B 0C
1FFF FFF0: 0D 0E 0F 10 11 12 13 14 15 16 00 00 00 00 00 00

Then I want to process the data above 16 by 16 bytes, starting from 1FFFFFE4, like this:

MOV RDX, 1FFFFFE4 
MOV RCX, 2
@MAIN:
  VMOVDQU XMM0, [RDX]
  ... data processing
  ADD RDX, 16
LOOP @MAIN

The last iteration will read 16 bytes from 1FFFFFF4, while I only have only 6 valid bytes of data there, with the rest of 10 bytes being potentially out of the allocated memory range (particularly the last 4 bytes from 20000000).

Can the above code fail with access violation, in the unlikely but possible situation that the last read partially exceeds the allocated memory range, or if the first byte of the VMOVDQU argument is valid, it won't fail? Could anyone indicate in the Intel 64 SDK the exact rule for this?

If it can fail, is there any other solution than processing the end of the data in a slower but safer way (byte by byte rather than 16 by 16 bytes)? This is what I did before in such cases, but it basically means doubling the code (a SIMD and a slow code for the same task), which is extra work and potential bugs.

As the access violation is very unlikely to happen, I'm also thinking about catching the exception, loading the data in a safe way, and jumping back – this could keep the code simple, as the algorithm itself would remain, only a small code would need to be added for loading the data in a safer way, executed only in very-very rare situations. Below the code, but I don't know how to catch the exception in assembly, and I don't know whether the time penalty would be small enough to make sense:

VMOVDQU XMM0, [RDX]
@DATALOADED:  
... data processing
ADD RDX, 16
... the rest of the algorithm

@EXCEPTION:  // jumps here if the VMOVDQU fails with access violation, happens rarely anyway
...load data in XMM0 in a safer way
JMP @DATALOADED

I'm waiting for any other suggestions which could keep the code simple.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Which SIMD instruction set extension are you programming for? – fuz Dec 09 '22 at 18:16
  • 1
    But in general: yes, you'll get a #PF, #GP, or #SS, if any byte of the operand lies outside a mapped page or exceeds a segment limit. AVX-512 has memory fault suppression where this does not apply to mapped-out elements. One work around is to first align your data to one vector register so that no fetch crosses a page boundary. Such aligned fetches can safely exceed object boundaries as access violations realistically only occur once you cross into a different page (barring the esoteric case of a <16M segment limit). – fuz Dec 09 '22 at 18:34
  • 2
    [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739) - x86-64 only has paging, not segment limits, so 4k granularity. You can stop the loop when there's less than 16 bytes left, and check if you're close to the end of a page before doing a full vector if that helps. Or load the last 16 bytes of the valid region, possibly overlapping with work you already did, if that works for your algorithm. Note that bytes outside the buffer are *not* guaranteed to be zero, so you need to carefully craft your code to ignore them. – Peter Cordes Dec 09 '22 at 19:25
  • 2
    Unfortunately, this is a common situation when dealing with SIMD code. Either you pad the input, process the last bytes separately, make overlapping idempotent or copy the input in a suitable buffer. Reading past the buffer is considered a vulnerability. While it may not fault on average, it may be made to fault intentionally or the loaded bytes may leak in the output/a side channel. Catching the exception will make the code of the loop simple but unless the input is long enough it will hinder the performance and the handler must be tailored to skip the right amount of bytes. – Margaret Bloom Dec 09 '22 at 19:43
  • I hope you don't plan on doing that with stack- or malloc-allocated memory because that sounds like a sure-fire way to corrupt the heap or stack. At least that is far more likely than getting an access violation. – Homer512 Dec 09 '22 at 19:55
  • fuz: I'm using yet only AVX2 (the software is supposed to run on most computers, while AVX-512 is yet only in fancy/expensive ones). It is interesting though what you say about the memory fault suppression, can you indicate the exact instruction(s) of how to do this? – Ádám Bíró Dec 11 '22 at 08:06
  • Peter Cordes: I'm going to read the page you linked, it seems to cover the topic I'm interested in. By then, do you say that if address ...000 is valid in my code, then it is guaranteed that ...FFF will also be valid, due to the granularity of the x86-64 (and that has nothing to do with the Windows, but with the inherent structure of the processor)? – Ádám Bíró Dec 11 '22 at 08:14
  • You need to @ their names if you want fuz and Peter to get notified. Now they will only see your comments if they check the post on their own. https://meta.stackexchange.com/questions/43019/how-do-comment-replies-work – Homer512 Dec 11 '22 at 09:27

1 Answers1

3

Here is my take on dealing with this. I'm using a partially overlapped final iteration (plus an optional one for the initial vector loop alignment).

The advantage of this approach is that the last few elements can be dealt with in a single extra loop iteration.

The downsides are:

  • Needs a fallback if the entire array is less than 16 byte
  • May lead to costly load-store forwarding stalls in read-modify-write loops. Use it for a[i] = b[i] + c[i] but not a[i] += b[i]. If aliasing may be used, it is easy enough to modify the code to catch the case a == b || a == c and use the fallback
  • May need some hardware-specific tuning when ported to AVX2 or AVX512. Specifically: Should the final iteration use the full 32 or 64 byte vectors or should it only be used for the final 16 byte vector?
  • Not applicable if the elements are not position-invariant within a vector register, e.g. if you do shuffling, variable shifting, etc.

I'm also tossing in an optional alignment of one of the memory locations; here I chose the output. I don't think that is particularly necessary for AVX but it uses the same technique and might come in handy if you adapt to SSE2 or AVX512.

I'm writing this in C++ with Intel intrinsics but the assembler output is very readable if you want to adapt it into ASM.

#include <immintrin.h>

#include <cstddef>

void vector_add(float* out, std::ptrdiff_t n, const float* left, const float* right)
{
    __m128 left_i, right_i, out_i;
    std::ptrdiff_t i = 0;
    if(n >= 4) {
#     ifdef ALIGN_OUTPUT
        /*
         * Optional: Do one unaligned iteration, then move the counter
         * up to the first 16-byte aligned output element
         */
        left_i = _mm_loadu_ps(left);
        right_i = _mm_loadu_ps(right);
        out_i = _mm_add_ps(left_i, right_i);
        _mm_storeu_ps(out, out_i);
        i = ((reinterpret_cast<std::ptrdiff_t>(out + 4) & ~15)
            - reinterpret_cast<std::ptrdiff_t>(out)) / sizeof(float);
#     endif
        for(; n - i >= 4; i += 4) {
            left_i = _mm_loadu_ps(left + i);
            right_i = _mm_loadu_ps(right + i);
            out_i = _mm_add_ps(left_i, right_i);
#         ifdef ALIGN_OUTPUT
            _mm_store_ps(out + i, out_i);
#         else
            _mm_storeu_ps(out + i, out_i);
#         endif
        }
        if(n - i > 0) {
            /*
             * Since we know we had at least 4 elements, we can just
             * repeat the operation for the last full vector.
             * If we use ALIGN_OUTPUT, have misaligned pointers, and n == 4,
             * then we compute the same 4 elements twice.
             * Probably not worth fixing
             */
            i = n - 4;
            left_i = _mm_loadu_ps(left + i);
            right_i = _mm_loadu_ps(right + i);
            out_i = _mm_add_ps(left_i, right_i);
            _mm_storeu_ps(out + i, out_i);
        }
        return;
    }
    /* Fallback if n <= 3 */
    if(n >= 2) {
        left_i = _mm_loadl_pi(_mm_undefined_ps(), (const __m64*) left);
        right_i = _mm_loadl_pi(_mm_undefined_ps(), (const __m64*) right);
        out_i = _mm_add_ps(left_i, right_i);
        _mm_storel_pi((__m64*) out, out_i);
        i = 2;
    }
    if(n - i >= 1)
        out[i] = left[i] + right[i];
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Homer512
  • 9,144
  • 2
  • 8
  • 25
  • Although I believe it won't solve my particular problem, it's quite a useful approach indeed. In my case the problem is that I need to de-interlace data, e.g. RGBRGB... into RR... GG... BB..., so from one 16-byte register (storing 5 RGB pixels) I'll shuffle three registers of 5 output bytes each, then I'll increment the input by 15 and the 3 outputs by 5 bytes each. The problem is the extra 1 byte at the end which could theoretically cause GP, and aligning the data to the end of the register would need different shuffle codes. But for the CMYKCMYK... type of data would work just fine. – Ádám Bíró Dec 12 '22 at 14:29
  • @ÁdámBíró If you know your hardware has fast ```vmaskmov``` instructions, you can use this approach instead: https://stackoverflow.com/questions/74515863/what-is-the-best-way-to-loop-avx-for-un-even-non-aligned-array/74517619#74517619 – Homer512 Dec 12 '22 at 15:08
  • interesting idea indeed, I see the Intel manual emphasizes that this will not cause protection fault as long as the mask is correct. In my particular case it won't work though, as I deal with 3-byte data units (rather than 4-8), I'd need byte level mask. – Ádám Bíró Dec 14 '22 at 14:14