3

Basically I am looking to implement the following in x86_64 assembly as fast as possible. (Where foo and bar may be something like glibc's hand-written asm strcpy or strcmp, and we want to start out with wide vectors but without the safety and/or performance downsides of a page-split load when one isn't needed. Or of an AVX-512 masked store: fault suppression works for correctness but is slow if it has to actually suppress a fault in the destination.)

#define TYPE __m256i
int has_page_cross(void * ptr1, void * ptr2) {
   uint64_t ptr1_u64 = (uint64_t)ptr1;
   uint64_t ptr2_u64 = (uint64_t)ptr2;
   ptr1_u64 &= 4095;
   ptr2_u64 &= 4095;
   if((ptr1_u64 + sizeof(TYPE)) > 4096
      || (ptr2_u64 + sizeof(TYPE)) > 4096) {
      // There will be a page cross
      return foo_handling_page_cross(ptr1, ptr2);
   }
   return bar_with_no_page_cross(ptr1, ptr2);
}

There are a lot of really efficient ways to do this for one pointer many of which are discussed in Is it safe to read past the end of a buffer within the same page on x86 and x64? but there does not seem to be a particularly efficient approach for two pointers that does not sacrifice accuracy.

Approaches

From here on out assuming ptr1 is starting in rdi and ptr2 is starting in rsi. Load size will be represented by constant LSIZE.

Fast with False Positives

                                        // cycles, bytes
    movl    %edi, %eax                  // 0     , 2   # assuming mov-elimination
    orl     %esi, %eax                  // 0     , 5   #  which Ice Lake disabled
    andl    $4095, %eax                 // 1     , 10
    cmpl    $(4096 - LSIZE), %eax       // 2     , 15
    ja      L(page_cross)              

    /* less bytes       
    movl    %edi, %eax                  // 0     , 2
    orl     %esi, %eax                  // 1     , 5
    sall    $20, %eax                   // 2     , 8
    cmpl    $(4096 - LSIZE) << 20, %eax // 3     , 13
    ja      L(page_cross)
     */
  • Latency : 3c
  • Throughput: ~1.08c Measured (both versions).
  • Bytes : 13b

This approach is nice because it is fast with 3c latency (assuming eliminated movl %edi, %eax), has high throughput, and is pretty compact for the Front End.

The obvious drawback is that it will have false positive i.e rdi = 4000, rsi = 95. I think though that its performance should serve as the goal for a full correct solution.

Slower but Correct

This is the best I've been able to come up with

                                        // cycles, bytes
    leal    (LSIZE - 1)(%rdi), %eax     // 0     , 4
    leal    (LSIZE - 1)(%rsi), %edx     // 0     , 8
    xorl    %edi, %eax                  // 1     , 11
    xorl    %esi, %edx                  // 1     , 14
    orl     %edx, %eax                  // 2     , 17
    testl   $4096, %eax                 // 3     , 22
    jnz     L(page_cross)
  • Latency : 4c
  • Throughput: ~1.75c Measured (Note on Icelake so higher tput lea than older CPUs)
  • Bytes : 21b

It has a 4c latency which isn't so bad, but its throughput is worse, and it has a much larger code footprint.

Question

  1. Can either of these approaches be improved at all in terms of latency, throughput, or bytes? Generally I am most interested in latency > throughput > bytes?

My general goal is to get the correct case as fast as the false positive.

Edit: Fixed bug in Correct version.

CPUs: Personally I am tuning for CPUs with AVX512 so Skylake Server, Icelake, and Tigerlake but this question is targetted at entire Sandybridge family.

Noah
  • 1,647
  • 1
  • 9
  • 18
  • 2
    If you're tuning for Ice Lake, note that a microcode update disabled mov-elimination for integer registers to work around the ICL065 erratum. https://www.realworldtech.com/forum/?threadid=200635&curpostid=200659. So a dependent OR can't run in the same cycle as MOV – Peter Cordes Apr 23 '21 at 09:58
  • 2
    Note that false positives aren't a correctness problem, so it's a tradeoff that depends on how much extra cost `foo_handling_page_cross` has, as well as the false-positive rate. I guess for fault-suppression of an AVX-512 masked store for strcpy, it's pretty significant for small copies, and critically *doesn't* fault in the page so it can happen repeatedly for copies into the same buffer. (Even if it was copy-on-write or otherwise read-only in the page tables, but still logically allocated by malloc.) – Peter Cordes Apr 23 '21 at 10:07
  • 1
    I think correct solution needs to use `LSIZE - 1` for addition (lea)? – stepan Apr 23 '21 at 11:15
  • @PeterCordes Icelake was just the benchmark CPU, not tuning for it specifically. Added edit. – Noah Apr 23 '21 at 14:42

1 Answers1

4

With a single false positive at a % 4096 == 4096 - size, you could use this:

~a & (4096 - size) == 0

translates to assembly:

  not edi
  not esi
  test edi, (4096 - size)
  jz crosses-page-boundary
  test esi, (4096 - size)
  jz crosses-page-boundary
  (2 cycle latency)

Explanation: For size=32, we want the last 12-bits of the address to be larger than 4096 - 32 = 4064 = 0b1111'1110'0000. We know a number can be equal or larger than this number only if it has all the same leading 1-bits and anything in the low 5 bits. We can't test if all the specified bits are one easily, so we invert the bits and test if they're all zero with test edi, (4096 - size).


Note you could shift the false positive to a % 4096 == 0 (which I reckon is worse since it's more likely to happen?) by using neg instead of not (-a = ~a + 1, so if all low 5-bit values were zero, then after inversion they become 1 and adding one carries it into the tested region which makes it a false positive for a % 4096 == 0, but hides the false positive for a % 4096 == 4096 - size).

stepan
  • 1,043
  • 2
  • 8
  • 12
  • 1
    How efficiently could this be done with only one branch? I feel like it must be possible more efficiently than what ICC is doing when using `|` on two booleans (https://godbolt.org/z/z15a85qh3) (gcc and clang just branch twice), maybe with an `or` instruction somewhere, or somehow an `andn`? I guess just ORing the two pointers and checking if the resulting offset is near the end of a page would work, but that introduces a huge possibility of false positives. – Peter Cordes Apr 23 '21 at 14:09
  • @PeterCordes Maybe they can be combined with a shift? I.e something along the lines of (but more efficient than) [this](https://godbolt.org/z/Es9zzPvMn). The fact than shift will zero lower bits seems ripe for an `andn` but can't seem to make it work. – Noah Apr 23 '21 at 14:25
  • @Stephan nice! that is realy fast and faster / less code than the stuff in [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/questions/37800739/is-it-safe-to-read-past-the-end-of-a-buffer-within-the-same-page-on-x86-and-x64/37801845#37801845) for one page cross. A bit concerns about 2 branches though. – Noah Apr 23 '21 at 14:35
  • Thinking about this a bit more I guess this solution trades the 1 false positive for either 1b of code or 2 more ports for first instructions as compared with `not` -> `andl` / `sall`. Guess is pretty data dependent whether thats a win. – Noah Apr 23 '21 at 14:53
  • 1
    @Noah test on both of them (after a shift) makes it "a crosses page boundary" *and* "b crosses page boundary". That seems to be the culprit for one branch :p – stepan Apr 23 '21 at 15:13