0

I think this is a pretty common task and there should be some fast and concise solution. I have a quadword and I want to get the lowest byte position which is equal to 0x0A (line break in Linux). I wrote the following simple program:

SYS_exit equ 0x3C

section .text
    global _start

_start:
    mov rax, 0x0A
    mov rbx, [dt]
    mov rcx, 0x07
    loop:
        mov r13, rbx
        and r13, 0xFF
        cmp r13, 0x0A
        jz ok
        shr rbx, 8
        dec rcx
        jnz loop
        jmp fail
    ok:
        mov rax, SYS_exit
        mov rdi, 8
        sub rdi, rcx
        syscall
    fail:
        mov rax, SYS_exit
        mov rdi, -1
        syscall


section .data
    dt: dq 0xAB97450A8733AA1F

And it works pretty much fine. strace ./bin prints

execve("./bin", ["./bin"], [/* 69 vars */]) = 0
exit(5)                                 = ?
+++ exited with 5 +++

But the program looks ugly and actually I'm looking for a way to make this as fast as possible. Can you give some optimization advice?

markgz
  • 6,054
  • 1
  • 19
  • 41
St.Antario
  • 26,175
  • 41
  • 130
  • 318

1 Answers1

3

But the program looks ugly

Congratulations for noticing :P

I'm looking for a way to make this as fast as possible

SSE2 is baseline for x86-64, so you should use it. You can do this in a couple instructions, using pcmpeqb / pmovmskb to get a bitmap of byte-compare results, then use a bit-scan instruction like bsr (scan reverse gives you the index of the highest set bit).

default  rel      ; don't forget this: RIP-relative addressing is best for loading/storing global data

_start:
    movq    xmm0, [dt]             ; movq xmm0, rdx  or whatever works too.
    pcmpeqb xmm0, [newline_mask]   ; -1 for match, 0 for no-match
    pmovmskb edi, xmm0
    bsr     edi, edi                  ; index of highest set bit

    mov  eax, SYS_exit
    jz  .not_found                 ; BSR sets ZF if the *input* was zero
    ; [dt+rdi] == 0xA
    syscall                        ; exit(0..7)

.not_found:
    mov  edi, -1                   ; exit only cares about the low byte of its arg; a 64-bit -1 is pointless.
    syscall

section .rodata
    align 16
    newline_mask: times 16 db 0x0a

section .data
    dt: dq 0xAB97450A8733AA1F

Obviously in a loop you'd keep newline_mask in a register (and then you can broadcast-load it with AVX vbroadcastss, or SSE3 movddup, instead of needing a whole 16 byte constant in memory).

And of course you can do this for 16 bytes at a time with a movdqu load, or 32 bytes at a time with AVX2. If you have a large buffer, you're basically implementing a backwards memcmp and should look at optimized library implementations. They might combine pcmpeqb results for a whole cache line with por, so they save 3/4 of the pmovmskb work until the end when they sort out which part of the cache line had the hit.

If you care about AMD CPUs (where bsr is slow), maybe separately test for input=0 with test edi,edi / jz before using tzcnt. (tzcnt(x) gives you 31-bsr(x), or 32 if the input was all-zero.) If you can depend on BMI2 being available...


If you wanted to do it with a scalar loop, you could use byte compares on the low byte of a register instead of copying and masking the value.

    ; we test byte 7 first, so start the counter there.
    mov  edi, 7         ; no idea why you were using a 64-bit counter register
   ; loop body runs with edi=7..0
.loop:                ; do{
    rol  rbx, 8         ; high byte becomes low

    cmp  bl, 0xa        ; check the low byte
    je   .found

    dec  edi
    jge  .loop        ; } while(--edi>=0) signed compare
   ; not found falls through with edi=-1

 .found:
    mov  eax, SYS_exit
    syscall           ; exit(7..0) for found, or exit(-1) for not-found

Depending on what you're doing with the result, you might arrange your loop counter differently.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • BTW, this gets the *last* line separator, like your question title asks for. I assumed it was a bug in your code. Use `bsf` if you want the first. (Or in the scalar loop, use `shr rbx,8` *after* checking the low byte). – Peter Cordes May 23 '18 at 20:34
  • @fuz: you can use it, but I'm not sure it would be a *good* use even though it has the BSF functionality built-in. On Skylake it's 3 uops for p0, with 12c latency. But `pcmpeqb` is 1c for p01, `pmovmskb` is 2 to 3c for p0, and `bsf` is 1c for p1. So this has better throughput and latency than `pcmpistri`. `pcmpistri` could possibly be useful if you don't know the length and have to check for a terminating byte as well, i.e. to implement `strchr` instead of `memchr`. But probably still no; probably only useful for more complicated searches like character ranges. – Peter Cordes May 24 '18 at 09:50
  • 2
    And if you have AVX2, SSE4.2 string instructions aren't worth considering for this; 32 bytes at a time blows them away. – Peter Cordes May 24 '18 at 09:53