Optimizing a loop that prints a counter

Question

I have a very small loop program that prints the numbers from 5000000 to 1. I want to make it run the fastest possible.

I'm learning linux x86-64 assembly with NASM.

global  main
extern  printf
main:
    push    rbx                     
    mov rax,5000000d
print:
    push    rax                     
    push    rcx                     
    mov     rdi, format             
    mov     rsi, rax                
    call    printf                  
    pop     rcx                     
    pop     rax                     
    dec     rax                     
    jnz     print                   
    pop     rbx                     
    ret

format:
db  "%ld", 10, 0

Do you have any idea what kind of monster calculation that `printf` is doing? :-o You should. Check how numbers are converted to decimal formatting. The Peter's answer obviously gets rid of that, as with your example it's easy to operate over strings and avoid numbers, which is *much* faster in this particular case. — Ped7g, Dec 08 '16 at 01:26

score 3 · Accepted Answer · edited May 23 '17 at 12:01

The call to printf completely dominates the run-time of even that highly inefficient loop. (Did you notice that you push/pop rcx even though you never use it anywhere? Perhaps that's left over from using the slow LOOP instruction).

To learn more about writing efficient x86 asm, see Agner Fog's Optimizing Assembly guide. (And his microarchitecture guide, too, if you want to really get into the details of specific CPUs and how they're different: What's optimal on one uarch CPU might not be on another. e.g. IMUL r64 has much better throughput and latency on Intel CPUs than on AMD, but CMOV and ADC are 2 uops on Intel pre-Broadwell, with 2 cycle latency. vs. 1 on AMD, since 3-input ALU m-ops (FLAGS + both registers) aren't a problem for AMD.) Also see other links in the x86 tag wiki.

Purely optimizing the loop without changing the 5M calls to printf is useful only as an example of how to write a loop properly, not to actually speed up this code. But let's start with that:

; trivial fixes to loop efficiently while calling the same slow function
global  main
extern  printf
main:
    push    rbx
    mov     ebx, 5000000         ; don't waste a REX prefix for constants that fit in 32 bits
.print:
    ;; removed the push/pops from inside the loop.
    ; Use call-preserved regs instead of saving/restoring stuff inside a loop yourself.
    mov     edi, format          ; static data / code always has a 32-bit address
    mov     esi, ebx
    xor     eax, eax             ; The x86-64 SysV ABI requires al = number of FP args passed in FP registers for variadic functions
    call    printf                  
    dec     ebx
    jnz     .print

    pop     rbx                ; restore rbx, the one call-preserved reg we actually used.
    xor     eax,eax            ; successful exit status.
    ret

section .rodata       ; it's usually best to put constant data in a separate section of the text segment, not right next to code.
format:
db  "%ld", 10, 0

To speed this up, we should take advantage of the redundancy in converting consecutive integers to strings. Since "5000000\n" is only 8 bytes long (including the newline), the string representation fits in a 64-bit register.

We can store that string into a buffer and increment a pointer by the string length. (Since it will get shorter for smaller numbers, just keep the current string length in a register, which you can update in the special-case branch where it changes.)

We can decrement the string representation in-place to avoid ever (re)doing the process of dividing by 10 to turn an integer into a decimal string.

Since carry/borrow doesn't naturally propagate inside a register, and the AAS instruction isn't available in 64-bit mode (and only worked on AX, not even EAX, and is slow), we have to do it ourselves. We're decrementing by 1 every time, so we know what's going to happen. We can handle the least-significant digit by unrolling 10 times, so there's no branching to handle it.

Also note that since we want to the digits in printing order, carry goes the wrong direction anyway, since x86 is little-endian. If there was a good way to take advantage of having our string in the other byte order, we could maybe use BSWAP or MOVBE. (But note that MOVBE r64 is 3 fused-domain uops on Skylake, 2 of them ALU uops. BSWAP r64 is also 2 uops.)

Perhaps we should be doing odd/even counters in parallel, in two halves of an XMM vector register. But that stops working well once the string is shorter than 8B. Storing one number-string at a time, we can easily overlap. Still, we could do the carry-propagation stuff in a vector reg and store the two halves separately with MOVQ and MOVHPS. Or since 4/5th of the numbers from 0 to 5M are 7 digits, it might be worth having code for the special case where we can store a whole 16B vector of two numbers.

A better way to handle shorter strings: SSSE3 PSHUFB to shuffle the two strings to left-packed in a vector register, then a single MOVUPS to store two at once. The shuffle mask only needs to be updated when the string length (number of digits) changes, so the infrequently-executed carry-handling special case code can do that, too.

Vectorization of the hot part of the loop should be very straightforward and cheap, and should just about double performance.

;;; Optimized version: keep the string data in a register and modify it
;;; instead of doing the whole int->string conversion every time.

section  .bss
printbuf:  resb 1024*128 + 4096     ;  Buffer size ~= half L2 cache size on Intel SnB-family.  Or use a giant buffer that we write() once.  Or maybe vmsplice to give it away to the kernel, since we only run once.

global  main
extern  printf
main:
    push    rbx

    ; use some REX-only regs for values that we're always going to use a REX prefix with anyway for 64-bit operand size.
    mov     rdx, `5000000\n`   ; (NASM string constants as integers work like little-endian, so AL = '5' = 0x35 and the high byte holds '\n' = 10).  Note that YASM doesn't support back-ticks for C-style backslash processing.
    mov     r9, 1<<56         ; decrement by 1 in the 2nd-last byte: LSB of the decimal string
    ;xor     r9d, r9d
    ;bts      r9, 56           ; IDK if this code-size optimization outside the loop would help or not.

    mov     eax, 8            ; string length.
    mov     edi, printbuf

.storeloop:

    ;;  rdx = "????x9\n".  We compute the start value for the next iteration, i.e. counter -= 10 in rdx.

    mov     r8, rdx
    ;;  r8 = rdx.  We modify it to have each last digit from 9 down to 0 in sequence, and store those strings in the buffer.
    ;;  The string could be any length, always with the first ASCII digit in the low byte; our other constants are adjusted correctly for it
    ;; narrower than 8B means that our stores overlap, but that's fine.

    ;; Starting from here to compute the next unrolled iteration's starting value takes the `sub r8, r9` instructions off the critical path, vs. if we started from r8 at the bottom of the loop.  This gives out-of-order execution more to play with.
    ;;  It means each loop iteration's sequence of subs and stores are a separate dependency chain (except for the store addresses, but OOO can get ahead on those because we only pointer-increment every 2 stores).

    mov     [rdi], r8
    sub     r8, r9             ; r8 = "xxx8\n"

    mov     [rdi + rax], r8    ; defer p += len by using a 2-reg addressing mode
    sub     r8, r9             ; r8 = "xxx7\n"

    lea     edi, [rdi + rax*2]  ; if we had len*3 in another reg, we could defer this longer
           ;; our static buffer is guaranteed to be in the low 31 bits of address space so we can safely save a REX prefix on the LEA here.  Normally you shouldn't truncate pointers to 32-bits, but you asked for the fastest possible.  This won't hurt, and might help on some CPUs, especially with possible decode bottlenecks.

    ;; repeat that block 3 more times.
    ;; using a short inner loop for the 9..0 last digit might be a win on some CPUs (like maybe Core2), depending on their front-end loop-buffer capabilities if the frontend is a bottleneck at all here.

    ;; anyway, then for the last one:
    mov     [rdi], r8             ; r8 = "xxx1\n"
    sub     r8, r9
    mov     [rdi + rax], r8       ; r8 = "xxx0\n"

    lea     edi, [rdi + rax*2]


    ;; compute next iteration's RDX.  It's probably a win to interleave some of this into the loop body, but out-of-order execution should do a reasonably good job here.
    mov     rcx, r9
    shr     rcx, 8      ; maybe hoist this constant out, too
    ; rcx = 1 in the second-lowest digit
    sub     rdx, rcx

    ; detect carry when '0' (0x30) - 1 = 0x2F by checking the low bit of the high nibble in that byte.
    shl     rcx, 5
    test    rdx, rcx
    jz      .carry_second_digit
    ; .carry_second_digit is some complicated code to propagate carry as far as it needs to go, up to the most-significant digit.
    ; when it's done, it re-enters the loop at the top, with eax and r9 set appropriately.
    ; it only runs once per 100 digits, so it doesn't have to be super-fast

    ; maybe only do buffer-length checks in the carry-handling branch,
    ; in which case the jz .carry  can be  jnz .storeloop
    cmp     edi, esi              ; } while(p < endp)
    jbe     .storeloop

    ; write() system call on the buffer.
    ; Maybe need a loop around this instead of doing all 5M integer-strings in one giant buffer.

    pop     rbx
    xor     eax,eax            ; successful exit status.
    ret

This is not fully fleshed-out, but should give an idea of what might work well.

If vectorizing with SSE2, probably use a scalar integer register to keep track of when you need to break out and handle carry. i.e. a down-counter from 10.

Even this scalar version probably comes close to sustaining one store per clock, which saturates the store port. They're only 8B stores (and when the string gets shorter, the useful part is shorter than that), so we're definitely leaving performance on the table if we don't bottleneck on cache misses. But with a 3GHz CPU and dual channel DDR3-1600 (~25.6GB/s theoretical max bandwidth), 8B per clock is about enough to saturate main memory with a single core.

We could parallelize this, and break the 5M .. 1 range up into chunks. With some clever math, we can figure out what byte to write the first character of "2500000\n", or we could have each thread call write() itself in the correct order. (Or use the same clever math to have them call pwrite(2) independently with different file offsets, so the kernel takes care of all the synchronization for multiple writers to the same file.)

Thank you peter for that descriptive answer.. I need soke time to understand the whole answer how ever for now i want to ask another question... So before posting the question i measured the time it took to run and it was 60 seconds then i apply you first suggestion file and i got 82 seconds i was surprised by that so i ran again my original code and it run in about 90 seconds so what is going on im running ubuntu 14.04.5 in a laptop with core i5 4200m and 8 gb of ram and even more strange when I got 60 seconds i was running a lot of apps at the same time so very weird — , Dec 09 '16 at 02:00
@Adou: If you let it print in a terminal window, that's an even bigger bottleneck. To time just the program, redirect its output to /dev/null. Or at least to a file. Piping it into `wc -c` would also be reasonable. Use `time ./a.out > /dev/null` and look at user time vs. system time vs. real time. Also try with `perf stat ./a.out > /dev/null` and look at what frequency your CPU was running it. It should ramp up pretty quickly, but turbo makes a big difference on laptop CPUs (since their limited power budget means max turbo is significantly higher than max sustained). — Peter Cordes, Dec 09 '16 at 02:48
See also [High throughput Fizz Buzz](https://codegolf.stackexchange.com/q/215216) for code that takes this idea to the extreme, incrementing an ASCII counter using registers instead of re-doing int->string conversion separately. — Peter Cordes, Mar 06 '22 at 21:54

score 1 · Answer 2 · answered Dec 07 '16 at 22:33

1

You're essentially printing a fixed string. I'd pre-generate that string into one long constant.

The program then becomes a single call to write (or a short loop to deal with incomplete writes).

answered Dec 07 '16 at 22:33

melpomene

84,125
8
85
148

I think you can generate it on the fly at near memset speeds, which is much faster than you can read a pre-generated string from disk. See my answer. – Peter Cordes Dec 08 '16 at 01:19
@PeterCordes Yeah, I'm assuming the program is already loaded into memory. – melpomene Dec 08 '16 at 01:26
That's a bad assumption. Especially compared to rewriting a small buffer in-place so the write() calls are reading from data that's already hot in L2 cache when the kernel copies it into the pagecache. (Or into a pipe buffer, or whatever). – Peter Cordes Dec 08 '16 at 03:37

Optimizing a loop that prints a counter

2 Answers2

Related