1

I was trying to assess effectiveness of my custom strlen() implementation versus the default C function.

inline ui64 Sl(const char *cs)
{
    const char *sbeg = cs--;
    while(*(++cs) != 0)
    {
        ;
    }
    
    return ui64(cs - sbeg);
}

I called both functions from separate wrapper functions with volatile local variable to force compiler to actually generate code with all optimizations enabled:

void __declspec(noinline) testR()
{
    volatile ui64 len = strlen(str12);
}

void __declspec(noinline) testG()
{
    volatile ui64 len = Sl(str12);
}

They were called from the unoptimised loop a couple of million times:

#pragma optimize( "", off )
void test()
{
    SWSET
    for(ui64 i = 0; i < TST_CYCLES; ++i)
    {
        testR();
    }
    SWSTOP
    SWSET
    for(ui64 i = 0; i < TST_CYCLES; ++i)
    {
        testG();
        //len = Sl(str);
    }
    SWSTOP
    
    SWIDRESET
}
#pragma optimize( "", on )

The results were, intriguing... My function turned out to be a bit faster (15-20%) on smaller strings, and identical on larger strings (1024+). So I decided to examine the ASM output to see what is going on there. Now, as you might know, strlen() is a default C function and its declaration is located in the string.h file:

_Check_return_
size_t __cdecl strlen(
    _In_z_ char const* _Str
    );

Of course, there is no implementation anywhere to be found (at least for MSVC). But you might assume (as I did, before this), that the actual code might live in some DLL or Static Library somewhere. Now imagine my surprise when I saw this:

; Function compile flags: /Ogtpy
; File D:\P\MT\prftst.cpp
;   COMDAT ?testR@@YAXXZ
_TEXT   SEGMENT
len$ = 8
?testR@@YAXXZ PROC                  ; testR, COMDAT

; 44   :    volatile ui64 len = strlen(str);

  00000 48 8d 0d 00 00
    00 00        lea     rcx, OFFSET FLAT:?str@@3PADA ; str
  00007 48 c7 c0 ff ff
    ff ff        mov     rax, -1
  0000e 66 90        npad    2  ; >>> xchg  ax,ax 
$LL3@testR:
  00010 48 ff c0     inc     rax
  00013 80 3c 01 00  cmp     BYTE PTR [rcx+rax], 0
  00017 75 f7        jne     SHORT $LL3@testR
  00019 48 89 44 24 08   mov     QWORD PTR len$[rsp], rax

; 45   : }

  0001e c3       ret     0
?testR@@YAXXZ ENDP                  ; testR
_TEXT   ENDS

Compiler inlined "library" function! How is this even possible? Am I not understanding how compiler works?

As far as I know, it can't inline DLL/Static Lib's functions... If not, the only explanation is that the function is actually generated by the compiler itself! So that string.h declaration is just a dummy! Is this correct?

One more piece of evidence for this is that compiler actually cheats, if you feed const cstring literal into strlen()! Previous ASM was generated with global char[] array that was memset() in main(), so compiler couldn't possibly know its length at compile time. When I fed it string literal, look what I got:

; Function compile flags: /Ogtpy
; File D:\P\MT\prftst.cpp
;   COMDAT ?testR@@YAXXZ
_TEXT   SEGMENT
len$ = 8
?testR@@YAXXZ PROC                  ; testR, COMDAT

; 60   :    volatile ui64 len = strlen(str200);

  00000 48 c7 44 24 08
    c8 00 00 00  mov     QWORD PTR len$[rsp], 200 ; 000000c8H

; 61   : }

  00009 c3       ret     0
?testR@@YAXXZ ENDP                  ; testR
_TEXT   ENDS

Yep. It just pastes the length of the cstring literal directly!

P.S.

One more side question that I can't figure out... Why my function is faster on small strings?

; Function compile flags: /Ogtpy
; File D:\P\MT\prftst.cpp
;   COMDAT ?testG@@YAXXZ
_TEXT   SEGMENT
len$ = 8
?testG@@YAXXZ PROC                  ; testG, COMDAT

; 17   :    const char *sbeg = cs--;

  00000 48 8d 0d 00 00
    00 00        lea     rcx, OFFSET FLAT:?str@@3PADA ; str
  00007 48 8d 41 ff  lea     rax, QWORD PTR [rcx-1]
  0000b 0f 1f 44 00 00   npad    5  ; >>> nop   DWORD PTR [rax+rax*1+0x0] 
$LL4@testG:

; 18   :    while(*(++cs) != 0)

  00010 48 ff c0     inc     rax
  00013 80 38 00     cmp     BYTE PTR [rax], 0
  00016 75 f8        jne     SHORT $LL4@testG

; 19   :    {
; 20   :        ;
; 21   :    }
; 22   :    
; 23   :    return ui64(cs - sbeg);

  00018 48 2b c1     sub     rax, rcx

; 49   :    volatile ui64 len = Sl(str);

  0001b 48 89 44 24 08   mov     QWORD PTR len$[rsp], rax

; 50   : }

  00020 c3       ret     0
?testG@@YAXXZ ENDP                  ; testG
_TEXT   ENDS

The loop part is identical to strlen() but uses 1 less addition (or is it that cmp BYTE PTR [rcx+rax], 0 uses same CPU cycles as cmp BYTE PTR [rax], 0?)...

I initially assumed that it must be slower on short strings because of additional preprocessing before the loop and subtraction after the loop. And faster on longer strings because of less operations in the loop itself.

However, profiling shows that it is actually faster on small strings (1-10 chars), and identical in speed on bigger strings (1024+)... Which greatly confuses me!

Sep Roland
  • 33,889
  • 7
  • 43
  • 76
ScienceDiscoverer
  • 205
  • 1
  • 3
  • 13
  • 1
    It's normal that compilers treat `strlen` and other important functions as "builtin" / "intrinsic" functions so they can constant-propagate through them or inline special-case versions (e.g 4B `memcpy`). GCC does the same thing, except it will never expand it to a naive byte-at-a-time loop that performs like total garbage for non-tiny strings. (At `-O1` it used to expand it to `repne scasb` which was equally bad or worse, but fixed after [Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled?](https://stackoverflow.com/a/55589634) revealed how bad that could be.) – Peter Cordes Aug 13 '23 at 17:53
  • 1
    *actually faster on small strings (1-10 chars)* - That might be an artifact of sloppy benchmarking. With small strings, warm-up (page faults, CPU frequency) is more of a factor, and you test `strlen` before `Sl`. If you test in the other order, it'll be your `Sl` that does many of its first runs with the CPU not yet at max turbo. [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) . (The string your scanning was just written recently, so the page fault on that memory happened outside the timed region) – Peter Cordes Aug 13 '23 at 17:58
  • Just to mention: There is an implementation of strlen in msvcrt.dll. As Peter describes, the compiler is unlikely to choose that one if optimizations are enabled. If you want to try to force it, you might try LoadLibrary/GetProcAddress. – David Wohlferd Aug 14 '23 at 08:15

0 Answers0