But the program looks ugly
Congratulations for noticing :P
I'm looking for a way to make this as fast as possible
SSE2 is baseline for x86-64, so you should use it. You can do this in a couple instructions, using pcmpeqb / pmovmskb to get a bitmap of byte-compare results, then use a bit-scan instruction like bsr
(scan reverse gives you the index of the highest set bit).
default rel ; don't forget this: RIP-relative addressing is best for loading/storing global data
_start:
movq xmm0, [dt] ; movq xmm0, rdx or whatever works too.
pcmpeqb xmm0, [newline_mask] ; -1 for match, 0 for no-match
pmovmskb edi, xmm0
bsr edi, edi ; index of highest set bit
mov eax, SYS_exit
jz .not_found ; BSR sets ZF if the *input* was zero
; [dt+rdi] == 0xA
syscall ; exit(0..7)
.not_found:
mov edi, -1 ; exit only cares about the low byte of its arg; a 64-bit -1 is pointless.
syscall
section .rodata
align 16
newline_mask: times 16 db 0x0a
section .data
dt: dq 0xAB97450A8733AA1F
Obviously in a loop you'd keep newline_mask
in a register (and then you can broadcast-load it with AVX vbroadcastss
, or SSE3 movddup
, instead of needing a whole 16 byte constant in memory).
And of course you can do this for 16 bytes at a time with a movdqu
load, or 32 bytes at a time with AVX2. If you have a large buffer, you're basically implementing a backwards memcmp
and should look at optimized library implementations. They might combine pcmpeqb
results for a whole cache line with por
, so they save 3/4 of the pmovmskb
work until the end when they sort out which part of the cache line had the hit.
If you care about AMD CPUs (where bsr
is slow), maybe separately test for input=0 with test edi,edi
/ jz
before using tzcnt
. (tzcnt(x)
gives you 31-bsr(x)
, or 32 if the input was all-zero.) If you can depend on BMI2 being available...
If you wanted to do it with a scalar loop, you could use byte compares on the low byte of a register instead of copying and masking the value.
; we test byte 7 first, so start the counter there.
mov edi, 7 ; no idea why you were using a 64-bit counter register
; loop body runs with edi=7..0
.loop: ; do{
rol rbx, 8 ; high byte becomes low
cmp bl, 0xa ; check the low byte
je .found
dec edi
jge .loop ; } while(--edi>=0) signed compare
; not found falls through with edi=-1
.found:
mov eax, SYS_exit
syscall ; exit(7..0) for found, or exit(-1) for not-found
Depending on what you're doing with the result, you might arrange your loop counter differently.