When I process N bytes of data with SIMD instructions (reading at least 16 bytes at once), normally I simply add padding to the end of the buffer, so I can safely round up the number of 16-byte blocks to read. However, this time I need to process data prepared by an external code, so theoretically it can happen that the last 16-byte vector of data partially falls outside of the allocated memory range.
For example, let's imagine I have stored 22 bytes of data, starting from 1FFF FFE4:
1FFF FFE0: 00 00 00 00 01 02 03 04 05 06 07 08 09 0A 0B 0C
1FFF FFF0: 0D 0E 0F 10 11 12 13 14 15 16 00 00 00 00 00 00
Then I want to process the data above 16 by 16 bytes, starting from 1FFFFFE4, like this:
MOV RDX, 1FFFFFE4
MOV RCX, 2
@MAIN:
VMOVDQU XMM0, [RDX]
... data processing
ADD RDX, 16
LOOP @MAIN
The last iteration will read 16 bytes from 1FFFFFF4, while I only have only 6 valid bytes of data there, with the rest of 10 bytes being potentially out of the allocated memory range (particularly the last 4 bytes from 20000000).
Can the above code fail with access violation, in the unlikely but possible situation that the last read partially exceeds the allocated memory range, or if the first byte of the VMOVDQU argument is valid, it won't fail? Could anyone indicate in the Intel 64 SDK the exact rule for this?
If it can fail, is there any other solution than processing the end of the data in a slower but safer way (byte by byte rather than 16 by 16 bytes)? This is what I did before in such cases, but it basically means doubling the code (a SIMD and a slow code for the same task), which is extra work and potential bugs.
As the access violation is very unlikely to happen, I'm also thinking about catching the exception, loading the data in a safe way, and jumping back – this could keep the code simple, as the algorithm itself would remain, only a small code would need to be added for loading the data in a safer way, executed only in very-very rare situations. Below the code, but I don't know how to catch the exception in assembly, and I don't know whether the time penalty would be small enough to make sense:
VMOVDQU XMM0, [RDX]
@DATALOADED:
... data processing
ADD RDX, 16
... the rest of the algorithm
@EXCEPTION: // jumps here if the VMOVDQU fails with access violation, happens rarely anyway
...load data in XMM0 in a safer way
JMP @DATALOADED
I'm waiting for any other suggestions which could keep the code simple.