Speed up strlen using SWAR in x86-64 assembly

Question

The asm function strlen receives the link to a string as a char - Array. To to so, the function may use SWAR on general purpose register, but without using xmm register or SSE instructions.

The function checks with the bit manipulation: (v - 0x01010101) & ~(v & 0x80808080)in 8 byte steps, if the word contains a zero byte, which marks the end of the string. If it does, it iterates byte per byte until the zero to avoid a page fault.

The alignment works like in this GNU Libc implementation:

for (char_ptr = str; ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0; ++char_ptr){
    if (*char_ptr == '\0'){
        return char_ptr - str;
    }
}

Is there any way to make it faster?

; rdi <- const *char
; rax <- counter + return value
; r10 <- current array for computation
; rcx,r8 <- Bitmask
; rsi <- Arr for calculation 
            
strlen:
    PUSH rbp
    SUB rsp,8
 
    XOR rax,rax    
    MOV r8,31
alignment:
    CMP byte [rdi+rax],0
    JE end
    
    MOV rsi,rdi
    ADD rsi,rax
    AND rsi,r8
    CMP rsi,0
    JE while_parallel
    INC rax
    JMP alignment       
          
while_parallel:  
    MOV rcx,0x01010101
    MOV r8,0x80808080
while_parallel_loop:  
    MOV r10,[rdi+rax]
    MOV rsi,r10
    
    NOT r10
    AND r10,r8
    SUB rsi,rcx
    AND rsi,r10
    
    CMP rsi,0
    JNE while_single
    ADD rax,8
    JMP while_parallel_loop
    
while_single:
    CMP byte [rdi+rax],0
    JE  end
    INC rax
    JMP while_single    
end:
    ADD rsp,8
    POP rbp
    RET

Note that I do not intend to use any SSE instructions nor xmm register.

Since you're asking for a mere speed improvement, why don't you post this working code on codereview.stackexchange.com/questions/tagged/assembly ? — Sep Roland, Jun 04 '23 at 17:28
@SepRoland: micro-optimization and optimization questions are on-topic here. Code-review can work to discuss all the missed optimizations (and bugs!) all over the place in this code, though. e.g. `0x01010101` is only a 4-byte constant, so it's going to ignore the high 4 bytes of each 8-byte qword it checks. — Peter Cordes, Jun 04 '23 at 18:24
What use-cases are most important for this? Like short strings vs. long strings? Are unaligned start addresses common? (If so, a better strategy for avoiding page-crossings would be much better). I guess you can't assume support for BMI1 `andn`? I assume this is kernel code, since that's the usual reason to avoid XMM registers (x86-64 is guaranteed to have SSE2). Kernel code often runs infrequently, so the cold-cache case might be more important? I assume you want to mostly tune for modern AMD and Intel like Zen and Skylake and later, and the E-cores in Alder Lake? — Peter Cordes, Jun 04 '23 at 18:39
Following from Peter's comment, make sure you understand why what you are attmpting is needed. For strings less than several thousand characters there is little to gain. Whether you unroll to 4-bytes or 8-bytes per-chunk is likewise de minimis on all but very long strings (hundreds of thousands of characters [basically files]). If you are dealing with strings of 1000 chars or less, you are not going to buy yourself much regardless of how micro-optimized you make it. (but... there is value in making it the best it can be -- to say you did it `:)` — David C. Rankin, Jun 05 '23 at 03:13
@SepRoland: It's now been cross-posted to codereview: [Speed up strlen using SWAR in x86-64 assembly](https://codereview.stackexchange.com/q/285337) — Peter Cordes, Jun 05 '23 at 07:09
@DavidC.Rankin: Lots of short `strlen` calls can add up, too, and be more likely than occasional huge strlen calls. There's no point going as slow as 1 byte at a time. Same for 4 vs. 8: if you're going to write a bithack version at all, the tiny extra cost from using wider constants (10-byte `mov r64, imm64` vs. 5-byte `mov r32, imm32`) will easily pay for itself on strings of length 16 or so, probably even 8 since it makes the branching more predictable, especially since njuffa points out you can use `bsf` on the bithack result, since the non-zero bit appears in the corresponding byte. — Peter Cordes, Jun 05 '23 at 07:17
(If you're optimizing for Sapphire Rapids, you might consider `repne scasb` since [SPR has fast-short-rep-cmsb-scasb](https://stackoverflow.com/questions/75309389/which-processors-support-fast-short-rep-cmpsb-and-scasb). But that would lead to garbage performance on older CPUs. But I don't know how fast `repne scasb` is on Alder Lake or Sapphire Rapids.) — Peter Cordes, Jun 05 '23 at 07:18

njuffa · Answer 1 · 2023-06-06T16:53:24.690

The code from the question seems to have a defect: Mycroft's magic constants have not been extended to 64 bits. As for efficiency: the various x86-64 calling conventions are primarily register-based, so it is not necessary to maintain a stack frame for simple functions. The main loop in the posted code appears a bit too complex; note that most x86 instructions set flags that can be used for branching. Instead of processing byte-wise until reaching the next QWORD boundary from the start of the string, simply read the entire QWORD it falls in, and overwrite the bytes preceding start of the string with 0xff. The resulting "masked' data can then be processed by the regular QWORD processing loop.

Below is x86-64 code written for Windows that incorporates these suggestions. Adjusting it to the System V ABI used by Linux should require nothing more than a cyclical exchange of registers. Note that on CPUs with BMI instruction set extension the 64-bit processing loop can be reduced by one instructions by merging the not rcx with the following and rcx, r9 into andn rcx, rcx, r9.

PUBLIC  strglen

_TEXT   SEGMENT

        ALIGN 16

;; Alan Mycroft's null-byte detection algorithm (newsgroup comp.lang.c, 1987/04/08,
;; https://groups.google.com/forum/#!original/comp.lang.c/2HtQXvg7iKc/xOJeipH6KLMJ):
;; null_byte(x) = ((x - 0x01010101) & (~x & 0x80808080))
 
mycroft_magic1 EQU 0101010101010101h
mycroft_magic2 EQU 8080808080808080h 

;; size_t strglen (const char *src);

;; Windows x86-64 calling convention:
;; function arguments: rcx, rdx, r8, r9
;; function return value: rax
;; scratch registers: rax, rcx, rdx, r8, r9, r10, r11

strglen PROC
        mov    rax, rcx             ; src pointer
        mov    r8, rcx              ; src pointer
        mov    r9, mycroft_magic2   ; 0x8080808080808080
        mov    r10, -mycroft_magic1 ; -0x0101010101010101
        and    rcx, 7               ; ofs = src & 7; src qword aligned ?
        jz     process_qwords       ; yes, ofs is zero, process qword wise

        sub    rax, rcx             ; src -= ofs
        shl    rcx, 3               ; ofs * CHAR_BIT
        mov    rdx, -1              ; ~0
        shl    rdx, cl              ; ~0 << (ofs * CHAR_BIT)
        mov    rcx, qword ptr [rax] ; load aligned qword
        not    rdx                  ; mask = ~(~0 << (ofs * CHAR_BIT))
        or     rcx, rdx             ; set unused bytes to 0xff
        jmp    process_qwords_alt   ; use alternate loop entry

        ALIGN 16

process_qwords:
        mov    rcx, qword ptr [rax] ; load aligned qword
process_qwords_alt:
        add    rax, 8               ; src += 8
        lea    rdx, [rcx + r10]     ; qword - 0x0101010101010101
        not    rcx                  ; ~qword
        and    rcx, r9              ; ~qword & 0x8080808080808080
        and    rcx, rdx             ; (qword - 0x0101010101010101) & (~qword & 0x8080808080808080)
        jz     process_qwords       ; if zero, no null byte found, continue

        bsf    rcx, rcx             ; find first set bit (null byte)
        shr    rcx, 3               ; convert bit position to byte position
        lea    rax, [rax + rcx - 8] ; reverse pre-increment to compute byte count
        sub    rax, r8              ; src - original src = length
        ret

strglen ENDP

        ALIGN 16

_TEXT   ENDS

        END

I used the ISO-C99 test scaffolding below to check the correctness of the strglen implementation above. I built with the Microsoft Macro Assembler and the Intel C/C++ compiler, as follows: ml64 -c strglen.asm, icl /Qstd=c99 /Ox /W4 strlen.c strglen.obj. I also inspected the generated machine code with dumpbin /disasm strglen.obj, mostly to make sure the alignment directives work as desired and that loop unrolling via the REPT macro works correctly, as I had not used that in the past twenty years.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

extern size_t strglen (const char* str);

int main (void)
{
    const char a[] = "0123456789 the quick brown fox jumps over the lazy dog";
    char* src =  malloc (sizeof(a));
    size_t ref, res;

    printf ("src=%p\n", a);

    for (int srcofs = 0; srcofs < 8; srcofs++) {
        for (size_t len = 0; len < sizeof(a); len++) {
            memcpy (src, a, sizeof(a));
            src [len] = 0;
            ref = strlen (src + srcofs);
            res = strglen (src + srcofs);
            if (res != ref) {
                printf ("error @ srcofs=%d  len=%zu  ref=%zu  res=%zu\n", 
                        srcofs, len, ref, res);
                return EXIT_FAILURE;
            }
        }
    }
    printf ("Test passed\n");
    return EXIT_SUCCESS;
}

The byte-at-a-time startup can be avoided for inputs that aren't within 7 bytes of the end of a page. As in glibc's strlen (https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/strlen-avx2.S.html#66), `if (p & (PAGE_SIZE - 1) <= (PAGE_SIZE-VEC_SIZE)){ load and check VEC_SIZE bytes; p &= -VEC_SIZE; }` then start the aligned loop. i.e. check the offset-within-page bits for being not near the end of a page. If they are, you can load the last 8 bytes of the page and bit-shift to get the bits you wanted. `p - (p&7)` or `(p|7) - 7` or just `p & -8`. Shift count = `8*(p&7)` I think. — Peter Cordes, Jun 05 '23 at 01:09
I think my comment translated that (well known?) idea from their asm into a tiny amount of C, so you can get the idea without having to look at the original. Otherwise I leave it as an exercise for the reader. Or I'd guess you might find similar code in most good strlen implementations, including *BSDs and MacOS libc and/or kernel, with non-GPL open-source licenses. If I do end up writing an answer showing the idea of an unaligned first "vector" that can overlap with the start of the aligned loop, I'll let you know. — Peter Cordes, Jun 05 '23 at 01:25
(I was trying to think of a way to optimize setting up the constants, since 10-byte `movabs reg, imm16` is slower to fetch from the uop-cache, and takes a lot of code-fetch bandwidth if not pre-decoded. `8080808080808080h` rotated left once is `0101010101010101h`, so `rorx r10, r9, 63` / `neg r10? But that's more uops and nearly 10 bytes. If we didn't need one of them negated, BMI2 `rorx` could be worth it. Or `lea r10, [r9+r9+1]` to get `01...h` from `80...h` in fewer bytes without BMI2, but still not negated. — Peter Cordes, Jun 05 '23 at 01:32
@PeterCordes perhaps distribute the `~x & 0x80` into `~(x | 0x7f)` to get the other constant negated first, then compute the first from that with LEA? This also shortens the critical path. so win-win! — amonakov, Jun 05 '23 at 06:12
Yeah, looks good to me. Would upvote again if I could. Thanks for taking the time to show what this would look like. Cool that you don't even need separate cleanup since the pointer math in the length calc just works even if you've gone a negative distance for the first load. Starting from the aligned qword containing the first byte of the string and forcing the extra bytes to non-zero makes more sense than my initial idea of shifting them to the bottom of a register (which would shift in zeros that you'd have to ignore, or require SHRD to shift in ones). — Peter Cordes, Jun 06 '23 at 13:39
There's some room for future readers to play with micro-optimizing this for different common cases or CPUs. e.g. if aligned buffers are common, the aligned-load + shift block could be out of line, after the `ret`, so the aligned case can just fall through into the loop with no taken branches. And there are other ways to make a mask like xor-zero / `bts`/`neg` that might be faster on Intel CPUs than shifting a `-1`, since modern Intel runs `bts reg,reg` as 1 uop (vs. 2 for AMD), and `shr reg, cl` as 3 uops (vs. 1 for AMD). BMI2 `shrx` is 1 uop on Intel, and there's `bzhi`... — Peter Cordes, Jun 06 '23 at 13:45

Speed up strlen using SWAR in x86-64 assembly

1 Answers1

Linked