I was trying to assess effectiveness of my custom strlen()
implementation versus the default C function.
inline ui64 Sl(const char *cs)
{
const char *sbeg = cs--;
while(*(++cs) != 0)
{
;
}
return ui64(cs - sbeg);
}
I called both functions from separate wrapper functions with volatile
local variable to force compiler to actually generate code with all optimizations enabled:
void __declspec(noinline) testR()
{
volatile ui64 len = strlen(str12);
}
void __declspec(noinline) testG()
{
volatile ui64 len = Sl(str12);
}
They were called from the unoptimised loop a couple of million times:
#pragma optimize( "", off )
void test()
{
SWSET
for(ui64 i = 0; i < TST_CYCLES; ++i)
{
testR();
}
SWSTOP
SWSET
for(ui64 i = 0; i < TST_CYCLES; ++i)
{
testG();
//len = Sl(str);
}
SWSTOP
SWIDRESET
}
#pragma optimize( "", on )
The results were, intriguing... My function turned out to be a bit faster (15-20%) on smaller strings, and identical on larger strings (1024+). So I decided to examine the ASM output to see what is going on there. Now, as you might know, strlen()
is a default C function and its declaration is located in the string.h
file:
_Check_return_
size_t __cdecl strlen(
_In_z_ char const* _Str
);
Of course, there is no implementation anywhere to be found (at least for MSVC). But you might assume (as I did, before this), that the actual code might live in some DLL or Static Library somewhere. Now imagine my surprise when I saw this:
; Function compile flags: /Ogtpy
; File D:\P\MT\prftst.cpp
; COMDAT ?testR@@YAXXZ
_TEXT SEGMENT
len$ = 8
?testR@@YAXXZ PROC ; testR, COMDAT
; 44 : volatile ui64 len = strlen(str);
00000 48 8d 0d 00 00
00 00 lea rcx, OFFSET FLAT:?str@@3PADA ; str
00007 48 c7 c0 ff ff
ff ff mov rax, -1
0000e 66 90 npad 2 ; >>> xchg ax,ax
$LL3@testR:
00010 48 ff c0 inc rax
00013 80 3c 01 00 cmp BYTE PTR [rcx+rax], 0
00017 75 f7 jne SHORT $LL3@testR
00019 48 89 44 24 08 mov QWORD PTR len$[rsp], rax
; 45 : }
0001e c3 ret 0
?testR@@YAXXZ ENDP ; testR
_TEXT ENDS
Compiler inlined "library" function! How is this even possible? Am I not understanding how compiler works?
As far as I know, it can't inline DLL/Static Lib's functions...
If not, the only explanation is that the function is actually generated by the compiler itself! So that string.h
declaration is just a dummy! Is this correct?
One more piece of evidence for this is that compiler actually cheats, if you feed const cstring literal into strlen()
! Previous ASM was generated with global char[]
array that was memset()
in main()
, so compiler couldn't possibly know its length at compile time. When I fed it string literal, look what I got:
; Function compile flags: /Ogtpy
; File D:\P\MT\prftst.cpp
; COMDAT ?testR@@YAXXZ
_TEXT SEGMENT
len$ = 8
?testR@@YAXXZ PROC ; testR, COMDAT
; 60 : volatile ui64 len = strlen(str200);
00000 48 c7 44 24 08
c8 00 00 00 mov QWORD PTR len$[rsp], 200 ; 000000c8H
; 61 : }
00009 c3 ret 0
?testR@@YAXXZ ENDP ; testR
_TEXT ENDS
Yep. It just pastes the length of the cstring literal directly!
P.S.
One more side question that I can't figure out... Why my function is faster on small strings?
; Function compile flags: /Ogtpy
; File D:\P\MT\prftst.cpp
; COMDAT ?testG@@YAXXZ
_TEXT SEGMENT
len$ = 8
?testG@@YAXXZ PROC ; testG, COMDAT
; 17 : const char *sbeg = cs--;
00000 48 8d 0d 00 00
00 00 lea rcx, OFFSET FLAT:?str@@3PADA ; str
00007 48 8d 41 ff lea rax, QWORD PTR [rcx-1]
0000b 0f 1f 44 00 00 npad 5 ; >>> nop DWORD PTR [rax+rax*1+0x0]
$LL4@testG:
; 18 : while(*(++cs) != 0)
00010 48 ff c0 inc rax
00013 80 38 00 cmp BYTE PTR [rax], 0
00016 75 f8 jne SHORT $LL4@testG
; 19 : {
; 20 : ;
; 21 : }
; 22 :
; 23 : return ui64(cs - sbeg);
00018 48 2b c1 sub rax, rcx
; 49 : volatile ui64 len = Sl(str);
0001b 48 89 44 24 08 mov QWORD PTR len$[rsp], rax
; 50 : }
00020 c3 ret 0
?testG@@YAXXZ ENDP ; testG
_TEXT ENDS
The loop part is identical to strlen()
but uses 1 less addition (or is it that cmp BYTE PTR [rcx+rax], 0
uses same CPU cycles as cmp BYTE PTR [rax], 0
?)...
I initially assumed that it must be slower on short strings because of additional preprocessing before the loop and subtraction after the loop. And faster on longer strings because of less operations in the loop itself.
However, profiling shows that it is actually faster on small strings (1-10 chars), and identical in speed on bigger strings (1024+)... Which greatly confuses me!