Assembly strlen AVX512BW optimze and speed up

Question

This is my source code in Assembly for strlen using AVX512BW

strlen_avx512:

        mov     rax, rdi

        test    al, 63          ; aligned ?
        jz      .aligned_str

        vmovdqu64       zmm0, zword [rax]       ; unaligned load
        vptestnmb       k0, zmm0, zmm0
        kortestq        k0, k0
        jz              .do_align_64

        kmovq   rcx, k0
        tzcnt   rax, rcx
        vzeroupper
        ret

        .do_align_64:
        add     rax, 63
        and     rax, -64

        .aligned_str:
        vmovdqa64       zmm0, ZWORD [rax]
        vmovdqa64       zmm1, ZWORD [rax+64]
        vmovdqa64       zmm2, ZWORD [rax+128]
        vmovdqa64       zmm3, ZWORD [rax+192]

        vpminub zmm4, zmm0, zmm1
        vpminub zmm5, zmm2, zmm3
        vpminub zmm5, zmm5, zmm4

        vptestnmb       k0, zmm5, zmm5  ; 0x00 ?
        kortestq        k0, k0
        jnz             .done

        add     rax, 256
        jmp     .aligned_str

        .done:
        sub     rax, edi

        vptestnmb       k0, zmm0, zmm0
        kortestq        k0, k0
        jnz             .end

        vptestnmb       k0, zmm1, zmm1
        kortestq        k0, k0
        jnz             .end1

        vptestnmb       k0, zmm2, zmm2
        kortestq        k0, k0
        jnz             .end2

        vptestnmb       k0, zmm4, zmm4

        add     rax, 192

        .end:
        kmovq   rcx, k0
        tzcnt   rcx, rcx
        add     rax, rcx
        vzeroupper
        ret
        .end1:
        add     rax, 64
        jmp     .end
        .end2:
        add     rax, 128
        jmp     .end

This function works without any problem but it's not gives me the speed that i expected ! i wrote an AVX2 version of this function (with vpminub (Same as this function)) with ymm registers and speed was very amazing (i called that function 1000000 times) and execution time was 4s but in this function, when i called it 1000000 times, the exection time was 3s (2.9s) i expect it to be something like 2 seconds or ... but it's just 1.5 time faster not 2x faster !

1 - i think this function needs some optimization to speed up ... is it possible to do something else for this function to speed up ?

2 - another question ... why vzeroupper ????!! i generated some avx512 code with gcc '-march=skylake-avx512' flag and gcc puts vzeroupper to the code so i added it to my source code too but why !!!?

3 - and other quesition about this function ... i saw some functions that wrote 'cross_page check' and ... is there any thing else that i have to check in this function (anything about page check and ...) ???

_"why vzeroupper ?"_ See section 6.3 in Agner Fog's [calling conventions document](https://www.agner.org/optimize/calling_conventions.pdf). — Michael, Apr 09 '20 at 12:50
`vmovdqu64 zmm0, [rax]` - that will fault if you pass it a pointer to a short string right before the end of a page. You could handle that with AVX512 masking (which suppresses faults) or by doing an aligned load and then shuffling to discard bytes from before the start of the string. Or if you never need to use this on strings that might end within 64 bytes of the end of a page, you can skip that overhead. And return a pointer to the end instead of length, if you want. — Peter Cordes, Apr 09 '20 at 12:52
What size strings did you test with? Obviously if you bottleneck on memory or L3-cache bandwidth, doubling the vector width won't help much. Or if your strings are short (like most strings in most programs are), the terminator will be in the first 32 bytes, or the loop will only run 1 iteration whether it does 4x32 or 4x64 bytes. If you need a `strlen` optimized for long strings, if possible use explicit-length strings that track their own length and don't need scanning. — Peter Cordes, Apr 09 '20 at 13:00
Also, 512-bit uops reduce max turbo, and shut down port 1 on Skylake-avx512. [SIMD instructions lowering CPU frequency](https://stackoverflow.com/a/56861355) and [Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?](https://stackoverflow.com/q/58568514) - some of that also explains vzeroupper interaction. — Peter Cordes, Apr 09 '20 at 13:01
i created an 1000000 BYTE string and i used a loop and filled that string like this str[strlen_avx512(str)] = 'h'; — ELHASKSERVERS, Apr 09 '20 at 13:02
About the second comment ... can you give me an example on my code (resolving that page fault problem ...) ? — ELHASKSERVERS, Apr 09 '20 at 13:20
Didn't see your reply because you didn't @ me. Re: handling that initial startup. Look at any optimized strlen, like glibc's, Agner Fog's, presumably MacOS / FreeBSD's, or a few other variants people have published that have source floating around. Or since it's asm, you can even disassemble closed source ones for ideas. They all have to solve this problem (which is part of why implicit-length strings suck for SIMD when you have to be compatible with code that can't guarantee padding after the terminator. Optimizing this can be a good learning exercise, but really try to use strlen less.) — Peter Cordes, Apr 10 '20 at 00:16

Assembly strlen AVX512BW optimze and speed up

0 Answers0