1

I decide to create a string-length function in Assembly (using FASM). My function takes a string (no matter aligned at 8 bytes or not) and checks if it's aligned at 8 bytes. If it's aligned, the main process (loop) will be begun. Otherwise, first 8 characters will be checked one-by-one, then the string will be aligned at 8 bytes and continue ... There will be no "end of the memory page" problem since the string will be aligned at 8 bytes boundary anyway and by this alignment, it will never face the end of memory page problem.

But the problem is that I decided to implement its C version too, and I compiled it, and now I have 2 assembly codes, the one I wrote it and the one is written in C and compiled to assembly. The problem is the C version is up to 1.5x faster than my handwritten assembly !!!!!!! In my code, everything is just fine, and I even aligned the jump-points to 16 bytes and there is no nop running (except one, out of the loop which is kinda nothing (.align8 to .loop)) !!! I can't find why my pure assembly code is 1.5x slower than the GCC version !!!

My Assembly source-code :

 align 16
slen:
        mov     r8, rcx
        test    cl, 7
        jz      .loop
        xor     eax, eax
        cmp     BYTE [rcx], al
        je      SHORT .ret
        cmp     BYTE [rcx+1], al
        je      SHORT .ret1
        cmp     BYTE [rcx+2], al
        je      SHORT .ret2
        cmp     BYTE [rcx+3], al
        je      SHORT .ret3
        cmp     BYTE [rcx+4], al
        je      SHORT .ret4
        cmp     BYTE [rcx+5], al
        je      SHORT .ret5
        cmp     BYTE [rcx+6], al
        je      SHORT .ret6
        cmp     BYTE [rcx+7], al
        jne     SHORT .align8
        mov     al, 7
        ret
 align 16
 .ret:  ret
 align 16
 .ret1: mov     al, 1
        ret
 align 16
 .ret2: mov     al, 2
        ret
 align 16
 .ret3: mov     al, 3
        ret
 align 16
 .ret4: mov     al, 4
        ret
 align 16
 .ret5: mov     al, 5
        ret
 align 16
 .ret6: mov     al, 6
        ret
 align 16
 .align8:
        lea     rcx, [rcx+7]
        and     rcx, (-8)
 align 16
 .loop: mov     rax, QWORD [rcx]
        test    al, al
        jz      SHORT .end
        test    ah, ah
        jz      SHORT .end.1
        test    eax, 0x00ff0000
        jz      SHORT .end.2
        test    eax, 0xff000000
        jz      SHORT .end.3
        shr     rax, 32
        test    al, al
        jz      SHORT .end.4
        test    ah, ah
        jz      SHORT .end.5
        test    eax, 0x00ff0000
        jz      SHORT .end.6
        test    eax, 0xff000000
        jz      SHORT .end.7
        add     rcx, 8
        jmp     SHORT .loop
 align 16
 .end: mov      rax, rcx
        sub     rax, r8
        ret
 align 16
 .end.1:
        lea     rax, [rcx+1]
        sub     rax, r8
        ret
 .end.2:
        lea     rax, [rcx+2]
        sub     rax, r8
        ret
 .end.3:
        lea     rax, [rcx+3]
        sub     rax, r8
        ret
 .end.4:
        lea     rax, [rcx+4]
        sub     rax, r8
        ret
 .end.5:
        lea     rax, [rcx+5]
        sub     rax, r8
        ret
 .end.6:
        lea     rax, [rcx+6]
        sub     rax, r8
        ret
 .end.7:
        lea     rax, [rcx+7]
        sub     rax, r8
        ret       

The GCC version :

 align 16
slen:
        test    cl, 7
        je      .L18
        xor     eax, eax
        cmp     BYTE [rcx], 0
        je      .L1
        cmp     BYTE [rcx+1], 0
        mov     eax, 1
        je      .L1
        cmp     BYTE [rcx+2], 0
        mov     eax, 2
        je      .L1
        cmp     BYTE [rcx+3], 0
        mov     eax, 3
        je      .L1
        cmp     BYTE [rcx+4], 0
        mov     eax, 4
        je      .L1
        cmp     BYTE [rcx+5], 0
        mov     eax, 5
        je      .L1
        cmp     BYTE [rcx+6], 0
        mov     eax, 6
        je      .L1
        cmp     BYTE [rcx+7], 0
        mov     eax, 7
        je      .L1
        lea     rax, [rcx+7]
        and     rax, -8
        jmp     .L47
 align 16
.L18:
        mov     rax, rcx
        jmp     .L47
 align 16
.L40:
        test    dh, dh
        je      .L49
        test    edx, 16711680
        je      .L50
        test    edx, 4278190080
        je      .L51
        shr     rdx, 32
        test    dl, dl
        je      .L52
        test    dh, dh
        je      .L53
        test    edx, 16711680
        je      .L54
        test    edx, 4278190080
        je      .L55
        add     rax, 8
.L47:
        mov     rdx, QWORD [rax]
        test    dl, dl
        jne     .L40
        sub     eax, ecx
.L1:
        ret
 align 16
.L49:
        sub     rax, rcx
        add     eax, 1
        ret
 align 16
.L50:
        sub     rax, rcx
        add     eax, 2
        ret
 align 16
.L51:
        sub     rax, rcx
        add     eax, 3
        ret
 align 16
.L52:
        sub     rax, rcx
        add     eax, 4
        ret
 align 16
.L53:
        sub     rax, rcx
        add     eax, 5
        ret
 align 16
.L54:
        sub     rax, rcx
        add     eax, 6
        ret
 align 16
.L55:
        sub     rax, rcx
        add     eax, 7
        ret   

My function test result :

string length => 336
loop execution times => 10000000
total execution time => 0.772015

GCC function test result :

string length => 336
loop execution times => 10000000
total execution time => 0.522015

What is the problem ? Why my function is 1.5x slower when everything is kinda looks fine? My string is aligned at 8 bytes, so you can skip the first one-by-one process and alignment.

Is there any problem with my label aligning ? Or the problem is from somewhere else?

ABI -> x64 (Windows)

CPU (Test) => i7-7800X

My C test application source-code :

#include <stdio.h>
#include <stdlib.h>
#include <windows.h>

unsigned int
slen_by_me(const char *);

unsigned int
slen_gcc(const char *);

int main() {
    static const char *str="WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW";
    LARGE_INTEGER frequency;
    LARGE_INTEGER start;
    LARGE_INTEGER end;
    double interval;
    unsigned int l = 0;

    QueryPerformanceFrequency(&frequency);
    QueryPerformanceCounter(&start);

    for (int i = 0; i < 10000000; i++) {
        l += slen_gcc(str);
    }

    QueryPerformanceCounter(&end);
    interval = (double) (end.QuadPart - start.QuadPart) / frequency.QuadPart;

    printf("%f\n%u\n", interval, l);
    return 0;
}

My object file (with these 2 slen functions to link to that C tester) creator in FASM :

format MS64 COFF

public slen_gcc
public slen_by_me

section '.text' code readable executable align 64

 align 16
slen_gcc:
        test    cl, 7
        je      .L18
        xor     eax, eax
        cmp     BYTE [rcx], 0
        je      .L1
        cmp     BYTE [rcx+1], 0
        mov     eax, 1
        je      .L1
        cmp     BYTE [rcx+2], 0
        mov     eax, 2
        je      .L1
        cmp     BYTE [rcx+3], 0
        mov     eax, 3
        je      .L1
        cmp     BYTE [rcx+4], 0
        mov     eax, 4
        je      .L1
        cmp     BYTE [rcx+5], 0
        mov     eax, 5
        je      .L1
        cmp     BYTE [rcx+6], 0
        mov     eax, 6
        je      .L1
        cmp     BYTE [rcx+7], 0
        mov     eax, 7
        je      .L1
        lea     rax, [rcx+7]
        and     rax, -8
        jmp     .L47
 align 16
.L18:
        mov     rax, rcx
        jmp     .L47
 align 16
.L40:
        test    dh, dh
        je      .L49
        test    edx, 16711680
        je      .L50
        test    edx, 4278190080
        je      .L51
        shr     rdx, 32
        test    dl, dl
        je      .L52
        test    dh, dh
        je      .L53
        test    edx, 16711680
        je      .L54
        test    edx, 4278190080
        je      .L55
        add     rax, 8
.L47:
        mov     rdx, QWORD [rax]
        test    dl, dl
        jne     .L40
        sub     eax, ecx
.L1:
        ret
 align 16
.L49:
        sub     rax, rcx
        add     eax, 1
        ret
 align 16
.L50:
        sub     rax, rcx
        add     eax, 2
        ret
 align 16
.L51:
        sub     rax, rcx
        add     eax, 3
        ret
 align 16
.L52:
        sub     rax, rcx
        add     eax, 4
        ret
 align 16
.L53:
        sub     rax, rcx
        add     eax, 5
        ret
 align 16
.L54:
        sub     rax, rcx
        add     eax, 6
        ret
 align 16
.L55:
        sub     rax, rcx
        add     eax, 7
        ret

 align 16
slen_by_me:
        mov     r8, rcx
        test    cl, 7
        jz      .loop
        xor     eax, eax
        cmp     BYTE [rcx], al
        je      SHORT .ret
        cmp     BYTE [rcx+1], al
        je      SHORT .ret1
        cmp     BYTE [rcx+2], al
        je      SHORT .ret2
        cmp     BYTE [rcx+3], al
        je      SHORT .ret3
        cmp     BYTE [rcx+4], al
        je      SHORT .ret4
        cmp     BYTE [rcx+5], al
        je      SHORT .ret5
        cmp     BYTE [rcx+6], al
        je      SHORT .ret6
        cmp     BYTE [rcx+7], al
        jne     SHORT .align8
        mov     al, 7
        ret
 align 16
 .ret:  ret
 align 16
 .ret1: mov     al, 1
        ret
 align 16
 .ret2: mov     al, 2
        ret
 align 16
 .ret3: mov     al, 3
        ret
 align 16
 .ret4: mov     al, 4
        ret
 align 16
 .ret5: mov     al, 5
        ret
 align 16
 .ret6: mov     al, 6
        ret
 align 16
 .align8:
        lea     rcx, [rcx+7]
        and     rcx, (-8)
 align 16
 .loop: mov     rax, QWORD [rcx]
        test    al, al
        jz      SHORT .end
        test    ah, ah
        jz      SHORT .end.1
        test    eax, 0x00ff0000
        jz      SHORT .end.2
        test    eax, 0xff000000
        jz      SHORT .end.3
        shr     rax, 32
        test    al, al
        jz      SHORT .end.4
        test    ah, ah
        jz      SHORT .end.5
        test    eax, 0x00ff0000
        jz      SHORT .end.6
        test    eax, 0xff000000
        jz      SHORT .end.7
        add     rcx, 8
        jmp     SHORT .loop
 align 16
 .end: mov      rax, rcx
        sub     rax, r8
        ret
 align 16
 .end.1:
        lea     rax, [rcx+1]
        sub     rax, r8
        ret
 .end.2:
        lea     rax, [rcx+2]
        sub     rax, r8
        ret
 .end.3:
        lea     rax, [rcx+3]
        sub     rax, r8
        ret
 .end.4:
        lea     rax, [rcx+4]
        sub     rax, r8
        ret
 .end.5:
        lea     rax, [rcx+5]
        sub     rax, r8
        ret
 .end.6:
        lea     rax, [rcx+6]
        sub     rax, r8
        ret
 .end.7:
        lea     rax, [rcx+7]
        sub     rax, r8
        ret

Also the C version of slen

int
slen(const char *str) {
    const char *start=str;
    if(((unsigned long long)str & 7) != 0) {
        if(str[0] == 0x00)
            return 0;
        if(str[1] == 0x00)
            return 1;
        if(str[2] == 0x00)
            return 2;
        if(str[3] == 0x00)
            return 3;
        if(str[4] == 0x00)
            return 4;
        if(str[5] == 0x00)
            return 5;
        if(str[6] == 0x00)
            return 6;
        if(str[7] == 0x00)
            return 7;
        str=(const char *)(((unsigned long long)str + 7) & (-8));
    }
    do {
        unsigned long long bytes=(*(unsigned long long*)(str));
        if((unsigned char)bytes==0x00)
            return (int)(str-start);
        if((bytes & 0x0000ff00)==0)
            return (int)(str-start+1);
        if((bytes & 0x00ff0000)==0)
            return (int)(str-start+2);
        if((bytes & 0xff000000)==0)
            return (int)(str-start+3);
        bytes >>= 32;
        if((unsigned char)bytes==0x00)
            return (int)(str-start+4);
        if((bytes & 0x0000ff00)==0)
            return (int)(str-start+5);
        if((bytes & 0x00ff0000)==0)
            return (int)(str-start+6);
        if((bytes & 0xff000000)==0)
            return (int)(str-start+7);
        str+=8;
    } while (1);
}
                
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
HelloGUI
  • 121
  • 7
  • What CPU did you run this on? If it's a version of Skylake, [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) is certainly possible. And/or it could be from branches predicting worse with a higher density of branch instructions. But no point in thinking too hard about possible microarchitectural effects until you tell us what microarchitecture! – Peter Cordes May 24 '23 at 15:39
  • @PeterCordes Intel 7800X – HelloGUI May 24 '23 at 15:41
  • Ok, that's a Skylake. Try compiling with `gcc -O3 -Wa,-mbranches-within-32B-boundaries`. And check with `perf` for `idq.mite_uops` vs. `idq.dsb_uops` against front-end uops (`uops_issued.any`). Also high counts for `resource_stalls.any` will tell you if the bottleneck was the back-end. Low counts but less than 4 uops per clock means the bottleneck was the front-end. – Peter Cordes May 24 '23 at 15:47
  • Can you post the caller so you have a [mcve] of the benchmark? – Peter Cordes May 24 '23 at 15:52
  • @PeterCordes Compiling my test application or the `slen` in C version? I created an `.obj` in assembly (with my `slen` version and GCC `slen` version), then I wrote a C program for testing and I linked that object to it to access those 2 functions. – HelloGUI May 24 '23 at 15:52
  • If I wanted to try this on my own Skylake system, the C caller would help. So would the C source you compiled; I guess the "gcc asm" in your question is FASM disassembly of an object file? Or you ported the `.s` by hand and that's why there's GCC's `.L1` etc. numbered labels? Anyway, an MCVE would be something I could compile + assemble and run on my machine, after minor tweaks for calling convention differences (or I guess I could declare it with `__attribute__((ms_abi))` for Linux GCC. – Peter Cordes May 24 '23 at 15:56
  • @PeterCordes I updated the question with full sources. – HelloGUI May 24 '23 at 15:58
  • BTW, for any future readers, this is an interesting experiment in branch layout, but *not* a useful high-performance `strlen`. ([Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/q/57650895)) An efficient implementation would use SSE2 to check 16 bytes at a time, or would use some version of a bithack (https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord) to check for any zero bytes within a 32-bit or 64-bit word. – Peter Cordes May 24 '23 at 16:02
  • @PeterCordes Yes, but it's about how I'm coding in Assembly ... I know everything about SIMD, but I'm fixing my code style .... I shocked when I see these results !!!!!! – HelloGUI May 24 '23 at 16:07
  • Note that I addressed my comment to future readers, not you. Have you tried avoiding JCC erratum penalties in your asm, or checking if GCC happens to? – Peter Cordes May 24 '23 at 16:47
  • @PeterCordes I don't know what is `JCC erratum penalities` ... I'm searching ... – HelloGUI May 24 '23 at 17:12
  • 1
    I linked [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) in my first comment. It has links to more details, specifically https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf . IDK if FASM has any options for automatically padding instructions to avoid it, so you might want to port your code to GNU assembler instead of porting GCC's asm to FASM. – Peter Cordes May 24 '23 at 17:16
  • @PeterCordes How GNU assembler handles this situation ? Is there a special directive ? – HelloGUI May 24 '23 at 17:50
  • Did you read my answer on [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646)? Yes, there is, `-mbranches-within-32B-boundaries`. I linked it for a reason, you know. – Peter Cordes May 24 '23 at 17:54
  • @PeterCordes Yes I read but since I know just a little about `GNU assembler`, I thought it might be for `C` purposes .... Thank you, I'm working on it .... – HelloGUI May 24 '23 at 17:56
  • 1
    @PeterCordes Wow ! In my code, from middle of loop after `shr rax, 32`, `test` and its `jz` are not in same 32 bytes boundaries. I pushed `test` forward by defining 2 bytes (0x2e) before `test` and now the result is `0.642015` instead of `0.772015` xDxDxDxD. `shr rax, 32 db 0x2e,0x2e test al, al jz SHORT .end.4` – HelloGUI May 24 '23 at 20:33

2 Answers2

3

Allow me to refer you to one of my Pure Assembly library function (coming soon). According to your question, it's about strlen (which named "str_length" in my library and developed for both Microsoft x64 ABI and System-v AMD64 ABI).

I remember (a few years ago) there was a C/C++ function about this type of string length calculator function.

size_t my_strlen(const char *s) {
    size_t len = 0;
    for(;;) {
        unsigned x = *(unsigned*)s;
        if((x & 0xFF) == 0) return len;
        if((x & 0xFF00) == 0) return len + 1;
        if((x & 0xFF0000) == 0) return len + 2;
        if((x & 0xFF000000) == 0) return len + 3;
        s += 4, len += 4;
    }
}

Even named "FAST strlen" which it's really not that fast. So, i decided to write my own "FAST strlen" in Assembly.

In x86-64, it's possible to load a 8-BYTE chunk into a 64-bit register so why 4-BYTE loading ? (As 'size_t my_strlen(const char *s)' did)

JCC Erratum

About 'JCC Erratum', still there are too many Skylake CPUs in the world (and by world, i mean DataCenters (check Hetzner datacenter and you find too many Skylake and old CPUs)). It's not optional, you MUST take care of this bad boy. But, it's very important to handle it without adding a NOP or even prefixes. Because by doing this, you make new problems for other CPUs. You can handle it by creating small new branches and putting some codes into a fresh 32-BYTE chunk (But don't make it too heavy).

TAKE CARE OF LOOP TAIL JUMP

Another subject, is taking care about the loops tail jump. Also again, you MUST make a tail for your loop and using jmp (unconditional jump) to jump to that tail (because of predictable branches subject (read the Agner Fog document about this bad boy (I love this guy for no reason xD)). Also, As Mr. Peter Cordes mentioned, and if you check the GCC jump method, you find the solution about loop creation and jumps.

TAKE CARE OF BRANCH ALIGNMENT

Yes, take care of branch alignment (16-BYTE boundaries), specially those you jump to, too many times (well, good boy|girl (sorry, no name), it's handled by you).

GCC REALLY ?! WHY NOT MACRO-FUSED ?

Well you (the question starter) did a right thing. You used a register for unaligned condition cmp so you have the benefit of macro-fusing. But in the code generated by GCC, you can see that cmp BYTE PTR [rcx], 0 is used. This will removes the benefit of macro-fusing from your code (its code actually (GCC)). Of course, GCC done it to handle the padding but it's really not acceptable.

An example of this situation in uiCA test tool:

0000000000000000 <.text>:
   0:   80 39 00                cmp    BYTE PTR [rcx],0x0
   3:   0f 84 00 00 00 00       je     0x9
   9:   38 01                   cmp    BYTE PTR [rcx],al
   b:   0f 84 00 00 00 00       je     0x11

The second cmp got M flag which stands for 'Macro-fused with previous instruction'.

Macro Fusion is restricted to 16-bit and 32-bit mode only (including 32-bit compatibility sub-mode in x86-64). CMP and TEST can fuse when comparing:

REG-REG. (e.g, CMP EAX,ECX; JZ label)
REG-IMM. (e.g., CMP EAX,0x80; JZ label)
REG-MEM. (e.g., CMP EAX,[ECX]; JZ label)
MEM-REG. (e.g., CMP [EAX],ECX; JZ label)

CMP and TEST can not be fused when comparing MEM-IMM (e.g. CMP [EAX],0x80; JZ label)

And finally about the function and its performance test (according to your needs). I bring you the one with Microsoft x64 ABI.

; libASM, independent standard libraries in Assembly (programming-language).
; For more information, please visit the libASM website (www.libasm.com).
; Copyright (C) 2023 Mr. Alireza Saeidipour. All rights reserved.

; Published by SOURCEBRING, under its international legal terms and conditions.
; For more information, please visit the SOURCEBRING website (www.sourcebring.com).

; “FAILURE GUARANTEES SUCCESS”
; — Alireza Saeidipour

    align.function
str_length:
    mov r8, rcx
    test    cl, 7
    jz  @f
    xor eax, eax
    cmp BYTE [rcx], al
    je  SHORT .len0
    cmp BYTE [rcx+1], al
    jne SHORT .unaligned_continue
    mov al, 1
    ret
    align.branch32
 .unaligned_continue:
    cmp BYTE [rcx+2], al
    je  SHORT .len2
    cmp BYTE [rcx+3], al
    je  SHORT .len3
    cmp BYTE [rcx+4], al
    je  SHORT .len4
    cmp BYTE [rcx+5], al
    je  SHORT .len5
    cmp BYTE [rcx+6], al
    je  .len6
    cmp BYTE [rcx+7], al
    je  .len7
    lea r8, [rcx+7]
    and r8, (-8)
    jmp @f
    align.branch
 .len0: ret
    align.branch
 .len2: mov eax, 2
    ret
    align.branch
 .len3: mov eax, 3
    ret
    align.branch
 .len4: mov eax, 4
    ret
    align.branch
 .len5: mov eax, 5
    ret
    align.branch
 .len6: mov eax, 6
    ret
    align.branch
 .len7: mov eax, 7
    ret
    align.branch
 .return_add7:
    lea rax, [r8+7]
    sub rax, r9
    ret
    align.branch
 @@:    mov r9, rcx
    mov ecx, 0x00ff0000
    mov edx, 0xff000000
    jmp SHORT @f
    align.branch32
 .loop: test    eax, ecx
    jz  SHORT .return_add2
    test    eax, edx
    jz  SHORT .return_add3
    shr rax, 32
    test    al, al
    jz  SHORT .return_add4
    test    ah, ah
    jz  SHORT .return_add5
    test    eax, ecx
    jz  SHORT .return_add6
    test    eax, edx
    jz  SHORT .return_add7
    add r8, 8
 @@:    mov rax, QWORD [r8]
    test    al, al
    jz  SHORT .return
    test    ah, ah
    jnz SHORT .loop
    lea rax, [r8+1]
    sub rax, r9
    ret
    align.branch
 .return:
    mov rax, r8
    sub rax, r9
    ret
    align.branch
 .return_add2:
    lea rax, [r8+2]
    sub rax, r9
    ret
    align.branch
 .return_add3:
    lea rax, [r8+3]
    sub rax, r9
    ret
    align.branch
 .return_add4:
    lea rax, [r8+4]
    sub rax, r9
    ret
    align.branch
 .return_add5:
    lea rax, [r8+5]
    sub rax, r9
    ret
    align.branch
 .return_add6:
    lea rax, [r8+6]
    sub rax, r9
    ret
 .size = $ - str_length

And Macros in this source-code:

macro align.function { align 32 }
macro align.branch { align 16 }
macro align.branch32 { align 32 }

This function considered as high-end solution (non-SIMD). You can find SIMD version of this function (9 functions are created only for string length operation) in my library soon (The library will be released by the end of June (2023)).

str_length
str_length_sse2
str_length_avx
str_length_avx2
str_length_avx512bw
str_length_long_sse2
str_length_long_avx
str_length_long_avx2
str_length_long_avx512bw

Test results (based on your parameters and your test tools (function (C)):

string length => 336
loop execution times => 10000000
total execution time => 0.430173

Yes, even faster than the one generated by GCC (0.522015). You will get same result for an unaligned string too.

Also, there is no 'JCC Erratum' problem in my code (The hex string of my function for you to check it).

 49 89 c8 f6 c1 07 0f 84 c4 00 00 00 31 c0 38 01
 74 3e 38 41 01 75 09 b0 01 c3 90 90 90 90 90 90
 38 41 02 74 3b 38 41 03 74 46 38 41 04 74 51 38
 41 05 74 5c 38 41 06 74 67 38 41 07 74 72 4c 8d
 41 07 49 83 e0 f8 e9 85 00 00 00 90 90 90 90 90
 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
 b8 02 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
 b8 03 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
 b8 04 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
 b8 05 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
 b8 06 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
 b8 07 00 00 00 c3 90 90 90 90 90 90 90 90 90 90
 49 8d 40 07 4c 29 c8 c3 90 90 90 90 90 90 90 90
 49 89 c9 b9 00 00 ff 00 ba 00 00 00 ff eb 21 90
 85 c8 74 4c 85 d0 74 58 48 c1 e8 20 84 c0 74 60
 84 e4 74 6c 85 c8 74 78 85 d0 74 c4 49 83 c0 08
 49 8b 00 84 c0 74 19 84 e4 75 d5 49 8d 40 01 4c
 29 c8 c3 90 90 90 90 90 90 90 90 90 90 90 90 90
 4c 89 c0 4c 29 c8 c3 90 90 90 90 90 90 90 90 90
 49 8d 40 02 4c 29 c8 c3 90 90 90 90 90 90 90 90
 49 8d 40 03 4c 29 c8 c3 90 90 90 90 90 90 90 90
 49 8d 40 04 4c 29 c8 c3 90 90 90 90 90 90 90 90
 49 8d 40 05 4c 29 c8 c3 90 90 90 90 90 90 90 90
 49 8d 40 06 4c 29 c8 c3

Warning: Please attention that FASM uses too many NOP for 'align' directive (instead of using a long NOP) so don't use this directive when there is no jmp above of it (As you say, direct access).

Warning: For old CPUs sake, keep your jumps body short and use registers instead of imm. And always handle 'JCC Erratum' (You lose 1.3x performance for that).

With best-regards.

2

I changed my code from

.loop: mov     rax, QWORD [rcx]
        test    al, al
        jz      SHORT .end
        test    ah, ah
        jz      SHORT .end.1
        test    eax, 0x00ff0000
        jz      SHORT .end.2
        test    eax, 0xff000000
        jz      SHORT .end.3
        shr     rax, 32
        test    al, al
        jz      SHORT .end.4
        test    ah, ah
        jz      SHORT .end.5
        test    eax, 0x00ff0000
        jz      SHORT .end.6
        test    eax, 0xff000000
        jz      SHORT .end.7
        add     rcx, 8
        jmp     SHORT .loop 

To (first, we jump to the '.loop' label):

.loop.continue:
        test    ah, ah
        jz      SHORT .end1
        test    eax, 0x00ff0000
        jz      SHORT .end2
        test    eax, 0xff000000
        jz      SHORT .end3
        shr     rax, 32
        test    al, al
        jz      SHORT .end4
        test    ah, ah
        jz      SHORT .end5
        test    eax, 0x00ff0000
        jz      SHORT .end6
        test    eax, 0xff000000
        jz      .end7
        lea     rcx, [rcx+8]
 .loop: mov     rax, QWORD [rcx]
        test    al, al
        jnz     SHORT .loop.continue
        mov     rax, rcx
        sub     rax, rdx
        ret

And even with 'JCC Erratum' problem, I get amazing result (0.532015). There was something wrong with my loop. In the first one, we jumped to loop and a QWORD taken and we started to search for 0x00 and at the end of the loop, 8 added to rcx (string memory address) and we have to jump to the loop (top) again.

But in solution, we jump to the end of loop and we handle the first check then we jump top to handle the others and by doing this, the speed problem fixed !!!

UPDATED

I just tried to make loop body smaller (in size (my first code)) and the result was amazing !!!!!

strl:
        push    rdi
        push    rsi
        mov     rdi, rcx
        mov     rsi, rcx
        mov     ecx, 0x00ff0000
        mov     edx, 0xff000000
        mov     r8, 0x000000ff00000000
        mov     r9, 0x0000ff0000000000
        mov     r10, 0x00ff000000000000
        mov     r11, 0xff00000000000000
        test    dil, 7
        jz      @f
        ; handle unaligned
  align 32
 @@:    mov     rax, QWORD [rdi]
        test    al, al
        jz      SHORT .end
        test    ah, ah
        jz      SHORT .end1
        test    eax, ecx
        jz      SHORT .end2
        test    eax, edx
        jz      SHORT .end3
        test    rax, r8
        jz      SHORT .end4
        test    rax, r9
        jz      SHORT .end5
        test    rax, r10
        jz      SHORT .end6
        test    rax, r11
        jz      SHORT .end7
        add     rdi, 8
        jmp     @b
  align 16
 .end:  mov     rax, rdi
        sub     rax, rsi
        pop     rsi
        pop     rdi
        ret
  align 16
 .end1: lea     rax, [rdi+1]
        sub     rax, rsi
        pop     rsi
        pop     rdi
        ret
  align 16
 .end2: lea     rax, [rdi+2]
        sub     rax, rsi
        pop     rsi
        pop     rdi
        ret
  align 16
 .end3: lea     rax, [rdi+3]
        sub     rax, rsi
        pop     rsi
        pop     rdi
        ret
  align 16
 .end4: lea     rax, [rdi+4]
        sub     rax, rsi
        pop     rsi
        pop     rdi
        ret
  align 16
 .end5: lea     rax, [rdi+5]
        sub     rax, rsi
        pop     rsi
        pop     rdi
        ret
  align 16
 .end6: lea     rax, [rdi+6]
        sub     rax, rsi
        pop     rsi
        pop     rdi
        ret
  align 16
 .end7: lea     rax, [rdi+7]
        sub     rax, rsi
        pop     rsi
        pop     rdi
        ret      
HelloGUI
  • 121
  • 7
  • 2
    One fewer `jmp` in the loop is a good thing; IDK maybe that was part of the problem. Running mostly not-taken macro-fused test+JCC, that's about 2 uops per clock throughput, and the average instruction length (counting test+JCC as one) is less than 8 bytes. So maybe legacy decode can keep up ok in this case. https://uica.uops.info/ predicts a decoder bottleneck of 8 cycles, but it seems to be assuming the jumps within the loop will all be taken (since it also thinks they can all only run on port 6, not p06.) Other analyzers like IACA don't do a useful job since they don't model the front-end – Peter Cordes May 26 '23 at 13:44
  • You might also be seeing a different in branch predictor success rate, just due to different positioning of the branches and thus which ones alias the same BHT entry for the relevant history that leads to them. – Peter Cordes May 26 '23 at 13:46
  • 1
    Also, why are you using `lea rcx, [rcx+8]`? Is that an attempt to only let it run on ports 1 or 5 on Skylake, so it can't steal a port from macro-fused-JCC (port 0 / 6)? Uops get scheduled to ports by looking at the length of the queues, and the queues for p0 and p6 will pretty much always be longer than for ports 1/5 in this loop, so a normal `add rcx, 8` will be fine. Both are the same machine-code length, though, so it's not a disaster. – Peter Cordes May 26 '23 at 13:49
  • @PeterCordes `jmp` has no cost since I removed it and before the last `test`, I used `add rcx, 8` and the last `test` converted to `jnz .loop` so the `jmp` is removed ... but still nothing changed. Also converting `lea` to `add` changed nothing. – HelloGUI May 26 '23 at 15:01
  • @PeterCordes Something weird happened ! In my first code, I just tried to have fewer codes (in size) so I decided to use registers for those 0x00ff0000, 0xff000000, and even I removed shr rax, 32 and used r8-r11 registers for 0x000000ff0000000 to 0xff0000000000000, and it worked !!!!! It seems by having less code size in loop, we reach to our target !!! Do you have any idea what is the problem ?? Please check the UPDATE. – HelloGUI May 26 '23 at 18:55
  • 1
    It doesn't avoid the JCC erratum, but with the loop being only short instructions, legacy decode can decode it fast (at up to 16 bytes per cycle or 4 (macro-fused) instructions per cycle, producing up to 5 uops but they're all single-uop so 4). And/or you got lucky with branches aliasing each other for branch prediction. IDK why you'd reintroduce the `jmp` at the bottom, though, making one more jmp/jcc inside the loop that Skylake has to run on ports 0 or 6. [Why are loops always compiled into "do...while" style (tail jump)?](https://stackoverflow.com/q/47783926) – Peter Cordes May 27 '23 at 02:08
  • 1
    You could use fewer registers if you put the masks for the top 4 bytes into the same register that looks for a byte in the low 4. e.g. if `RDX=0xff000000ff000000`, then `test eax, edx`/`jz .end3` will still check for a zero in byte 3, and `test rax, rdx`/`jz .end7` will test for a zero in byte 7 (and byte 3, but if that was zero we would have already branched to .end3 and not reached this test/jz.) – Peter Cordes May 27 '23 at 02:12
  • @PeterCordes about your last comment, it's not possible. Yes Checking `test eax, edx` and `jz .end` is correct but if the 0x00 be at 0xff00000000000000 the `jz .end7` never happen since the test condition become incorrect (0xff000000ff000000) the fourth character is not a 0x00 so `jz .end7` will not happen. – HelloGUI May 27 '23 at 08:33
  • Oh right, of course, all bits selected by the mask have to be zero for `jz` to be taken, so a non-zero in the lower byte will spoil it. `shr rax, 32` is probably doesn't cost any cycles (not creating a bottleneck, running in parallel with test/jz uops, and probably not a problem for the front-end). So you could do that to save registers and only need 32-bit constants, making the setup code before the loop cheaper while still using short instructions that legacy-decode can handle fast enough. – Peter Cordes May 27 '23 at 08:39
  • 2
    The fact that reading AH has extra latency is also probably not a problem, as long as branch prediction succeeds. ([How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent](https://stackoverflow.com/q/45660139). Reading AL is always fine after writing a larger register, but a uop that reads AH introduces an extra cycle of latency to forward data to it, even when the AH input isn't on the critical path, interestingly.) I mention that since shifting would let you do another set of AH and AL tests instead of a mask – Peter Cordes May 27 '23 at 08:42