The code from the question seems to have a defect: Mycroft's magic constants have not been extended to 64 bits. As for efficiency: the various x86-64 calling conventions are primarily register-based, so it is not necessary to maintain a stack frame for simple functions. The main loop in the posted code appears a bit too complex; note that most x86 instructions set flags that can be used for branching. Instead of processing byte-wise until reaching the next QWORD
boundary from the start of the string, simply read the entire QWORD
it falls in, and overwrite the bytes preceding start of the string with 0xff
. The resulting "masked' data can then be processed by the regular QWORD
processing loop.
Below is x86-64 code written for Windows that incorporates these suggestions. Adjusting it to the System V ABI used by Linux should require nothing more than a cyclical exchange of registers. Note that on CPUs with BMI instruction set extension the 64-bit processing loop can be reduced by one instructions by merging the not rcx
with the following and rcx, r9
into andn rcx, rcx, r9
.
PUBLIC strglen
_TEXT SEGMENT
ALIGN 16
;; Alan Mycroft's null-byte detection algorithm (newsgroup comp.lang.c, 1987/04/08,
;; https://groups.google.com/forum/#!original/comp.lang.c/2HtQXvg7iKc/xOJeipH6KLMJ):
;; null_byte(x) = ((x - 0x01010101) & (~x & 0x80808080))
mycroft_magic1 EQU 0101010101010101h
mycroft_magic2 EQU 8080808080808080h
;; size_t strglen (const char *src);
;; Windows x86-64 calling convention:
;; function arguments: rcx, rdx, r8, r9
;; function return value: rax
;; scratch registers: rax, rcx, rdx, r8, r9, r10, r11
strglen PROC
mov rax, rcx ; src pointer
mov r8, rcx ; src pointer
mov r9, mycroft_magic2 ; 0x8080808080808080
mov r10, -mycroft_magic1 ; -0x0101010101010101
and rcx, 7 ; ofs = src & 7; src qword aligned ?
jz process_qwords ; yes, ofs is zero, process qword wise
sub rax, rcx ; src -= ofs
shl rcx, 3 ; ofs * CHAR_BIT
mov rdx, -1 ; ~0
shl rdx, cl ; ~0 << (ofs * CHAR_BIT)
mov rcx, qword ptr [rax] ; load aligned qword
not rdx ; mask = ~(~0 << (ofs * CHAR_BIT))
or rcx, rdx ; set unused bytes to 0xff
jmp process_qwords_alt ; use alternate loop entry
ALIGN 16
process_qwords:
mov rcx, qword ptr [rax] ; load aligned qword
process_qwords_alt:
add rax, 8 ; src += 8
lea rdx, [rcx + r10] ; qword - 0x0101010101010101
not rcx ; ~qword
and rcx, r9 ; ~qword & 0x8080808080808080
and rcx, rdx ; (qword - 0x0101010101010101) & (~qword & 0x8080808080808080)
jz process_qwords ; if zero, no null byte found, continue
bsf rcx, rcx ; find first set bit (null byte)
shr rcx, 3 ; convert bit position to byte position
lea rax, [rax + rcx - 8] ; reverse pre-increment to compute byte count
sub rax, r8 ; src - original src = length
ret
strglen ENDP
ALIGN 16
_TEXT ENDS
END
I used the ISO-C99 test scaffolding below to check the correctness of the strglen
implementation above. I built with the Microsoft Macro Assembler and the Intel C/C++ compiler, as follows: ml64 -c strglen.asm
, icl /Qstd=c99 /Ox /W4 strlen.c strglen.obj
. I also inspected the generated machine code with dumpbin /disasm strglen.obj
, mostly to make sure the alignment directives work as desired and that loop unrolling via the REPT
macro works correctly, as I had not used that in the past twenty years.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
extern size_t strglen (const char* str);
int main (void)
{
const char a[] = "0123456789 the quick brown fox jumps over the lazy dog";
char* src = malloc (sizeof(a));
size_t ref, res;
printf ("src=%p\n", a);
for (int srcofs = 0; srcofs < 8; srcofs++) {
for (size_t len = 0; len < sizeof(a); len++) {
memcpy (src, a, sizeof(a));
src [len] = 0;
ref = strlen (src + srcofs);
res = strglen (src + srcofs);
if (res != ref) {
printf ("error @ srcofs=%d len=%zu ref=%zu res=%zu\n",
srcofs, len, ref, res);
return EXIT_FAILURE;
}
}
}
printf ("Test passed\n");
return EXIT_SUCCESS;
}