12

I have a simple test program that loads an xmm register with the movdqu instruction accessing data across a page boundary (OS = Linux).

If the following page is mapped, this works just fine. If it's not mapped then I get a SIGSEGV, which is probably expected.

However this diminishes the usefulness of the unaligned loads quite a bit. Additionally SSE4.2 instructions (like pcmpistri) which allow for unaligned memory references appear to exhibit this behavior as well.

That's all fine -- except there's many an implementation of strcmp using pcmpistri that I've found that don't seem to address this issue at all -- and I've been able to contrive trivial testcases that will cause these implementations to fail, while the byte-at-a-time trivial strcmp implementation will work just fine with the same data layout.

One more note -- it appears the the GNU C library implementation for 64-bit Linux has a __strcmp_sse42 variant that appears to use the pcmpistri instruction in a more safe manner. The implementation of this strcmp is fairly complex, but it appears to be carefully trying to avoid the page boundary issue. I'm not sure if that's due to the issue I describe above, or whether it's just a side-effect of trying to get better performance by aligning the data.

Anyway the question I have is primarily -- where can I find out more about this issue? I've typed in "movdqu crossing page boundary" and every variant of that I can think of to Google, but haven't come across anything particularly useful. If anyone can point me to further info on this it would be greatly appreciated.

user3299291
  • 121
  • 3
  • The `__strcmp_sse42` implementation is probably doing that to avoid the performance hit of crossing a page boundary. Intel processors (not sure about the latest ones) have had a history of appalling performance on mis-aligned accesses that cross page-boundaries. The page-fault issue should be irrelevant though. – Mysticial Feb 11 '14 at 22:52
  • 1
    I'm very curious about the answer to this one. The Intel Optimization Manual (section 10.3.6) only says that "Unaligned 128-bit SIMD memory access can fetch data cross page boundary, since system software manages memory access rights with page granularity.". Maybe try reproducing the same bug on some other OS? – Daniel Kamil Kozar Feb 11 '14 at 22:56
  • 1
    Or rather, the OS will respond to the page fault and page it in - invisible to the application (aside form a huge performance hit). Or crash the app if it isn't allocated. In which case, it's standard UB from accessing unallocated memory. – Mysticial Feb 11 '14 at 22:59
  • What exactly is the issue? `strcmp` will also generate `SIGSEG` if you pass an unterminated string and let it run into a non-mapped page. That's just what accessing a non-mapped page does. – Damon Feb 11 '14 at 23:11
  • 2
    In response to the last comment.. I've carefully constructed a test where a string is at offset 4090 of a 4K page with the value "test" and the '\0' byte. The following memory page is unmapped. When I use strcmp with that string as an argument things work fine. When I try the comparable pcmpistri instruction the entire 16-byte block is attempted to be loaded -- crossing into the next page, triggering the SIGSEGV. This is what is limiting the usefulness of pcmpistri for me, as well as why I'm wondering about some of the strcmp implementations using it I've found. – user3299291 Feb 11 '14 at 23:20

2 Answers2

8

First, any algorithm which tries to access an unmapped address will cause a SegFault. If a non-AVX code flow used a 4 byte load to access the last byte of a page and the first 3 bytes of "the next page" which happened to not be mapped then it would also cause a SegFault. No? I believe that the "issue" is that the AVX(1/2/3) registers are so much bigger than "typical" that algorithms which were unsafe (but got away with it) get caught if they are trivially extended to the larger registers.

Aligned loads (MOVDQA) can never have this problem since they don't cross any boundaries of their own size or greater. Unaligned loads CAN have this problem (as you've noted) and "often" do. The reason for this is that the instruction is defined to load the full size of the target register. You need to look at the operand types in the instruction definitions quite carefully. It doesn't matter how much of the data you are interested in. It matters what the instruction is defined to do.

However...

AVX1 (Sandybridge) added a "masked move" capability which is slower than a movdqa or movdqu but will not (architecturally) access the unmapped page so long as the mask is not enabled for the portion of the access which would have fallen in that page. This is meant to address the issue. In general, moving forward, it appears that masked portions (See AVX512) of loads/stores will not cause access violations on IA either.

(It is a bummer about PCMPxSTRx behavior. Perhaps you could add 15 bytes of padding to your "string" objects?)

Mike Julier
  • 141
  • 1
  • 4
3

Facing a similar problem with a library I was writing, I got some information from a very helpful contributor.

The core of the idea is to align the 16-byte reads to the end of the string, then handle the leftover bytes at the beginning. This works because the end of the string must live in an accessible page, and you are guaranteed that the 16-byte truncated starting address must also live in an accessible page.

Since we never read past the string we cannot potentially stray into a protected page.

To handle the initial set of bytes, I chose to use the PCMPxSTRM functions, which return the bitmask of matching bytes. Then it's simply a matter of shifting the result to ignore any mask bits that occur before the true beginning of the string.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366