Basically I am looking to implement the following in x86_64 assembly as fast as possible. (Where foo
and bar
may be something like glibc's hand-written asm strcpy or strcmp, and we want to start out with wide vectors but without the safety and/or performance downsides of a page-split load when one isn't needed. Or of an AVX-512 masked store: fault suppression works for correctness but is slow if it has to actually suppress a fault in the destination.)
#define TYPE __m256i
int has_page_cross(void * ptr1, void * ptr2) {
uint64_t ptr1_u64 = (uint64_t)ptr1;
uint64_t ptr2_u64 = (uint64_t)ptr2;
ptr1_u64 &= 4095;
ptr2_u64 &= 4095;
if((ptr1_u64 + sizeof(TYPE)) > 4096
|| (ptr2_u64 + sizeof(TYPE)) > 4096) {
// There will be a page cross
return foo_handling_page_cross(ptr1, ptr2);
}
return bar_with_no_page_cross(ptr1, ptr2);
}
There are a lot of really efficient ways to do this for one pointer many of which are discussed in Is it safe to read past the end of a buffer within the same page on x86 and x64? but there does not seem to be a particularly efficient approach for two pointers that does not sacrifice accuracy.
ApproachesFrom here on out assuming ptr1
is starting in rdi
and ptr2
is starting in rsi
. Load size will be represented by constant LSIZE
.
Fast with False Positives
// cycles, bytes
movl %edi, %eax // 0 , 2 # assuming mov-elimination
orl %esi, %eax // 0 , 5 # which Ice Lake disabled
andl $4095, %eax // 1 , 10
cmpl $(4096 - LSIZE), %eax // 2 , 15
ja L(page_cross)
/* less bytes
movl %edi, %eax // 0 , 2
orl %esi, %eax // 1 , 5
sall $20, %eax // 2 , 8
cmpl $(4096 - LSIZE) << 20, %eax // 3 , 13
ja L(page_cross)
*/
- Latency : 3c
- Throughput: ~1.08c Measured (both versions).
- Bytes : 13b
This approach is nice because it is fast with 3c latency (assuming eliminated movl %edi, %eax
), has high throughput, and is pretty compact for the Front End.
The obvious drawback is that it will have false positive i.e rdi = 4000
, rsi = 95
. I think though that its performance should serve as the goal for a full correct solution.
Slower but Correct
This is the best I've been able to come up with
// cycles, bytes
leal (LSIZE - 1)(%rdi), %eax // 0 , 4
leal (LSIZE - 1)(%rsi), %edx // 0 , 8
xorl %edi, %eax // 1 , 11
xorl %esi, %edx // 1 , 14
orl %edx, %eax // 2 , 17
testl $4096, %eax // 3 , 22
jnz L(page_cross)
- Latency : 4c
- Throughput: ~1.75c Measured (Note on Icelake so higher tput
lea
than older CPUs) - Bytes : 21b
It has a 4c latency which isn't so bad, but its throughput is worse, and it has a much larger code footprint.
Question
- Can either of these approaches be improved at all in terms of latency, throughput, or bytes? Generally I am most interested in latency > throughput > bytes?
My general goal is to get the correct case as fast as the false positive.
Edit: Fixed bug in Correct version.
CPUs: Personally I am tuning for CPUs with AVX512 so Skylake Server, Icelake, and Tigerlake but this question is targetted at entire Sandybridge family.