The call to printf completely dominates the run-time of even that highly inefficient loop. (Did you notice that you push/pop rcx even though you never use it anywhere? Perhaps that's left over from using the slow LOOP instruction).
To learn more about writing efficient x86 asm, see Agner Fog's Optimizing Assembly guide. (And his microarchitecture guide, too, if you want to really get into the details of specific CPUs and how they're different: What's optimal on one uarch CPU might not be on another. e.g. IMUL r64 has much better throughput and latency on Intel CPUs than on AMD, but CMOV and ADC are 2 uops on Intel pre-Broadwell, with 2 cycle latency. vs. 1 on AMD, since 3-input ALU m-ops (FLAGS + both registers) aren't a problem for AMD.) Also see other links in the x86 tag wiki.
Purely optimizing the loop without changing the 5M calls to printf is useful only as an example of how to write a loop properly, not to actually speed up this code. But let's start with that:
; trivial fixes to loop efficiently while calling the same slow function
global main
extern printf
main:
push rbx
mov ebx, 5000000 ; don't waste a REX prefix for constants that fit in 32 bits
.print:
;; removed the push/pops from inside the loop.
; Use call-preserved regs instead of saving/restoring stuff inside a loop yourself.
mov edi, format ; static data / code always has a 32-bit address
mov esi, ebx
xor eax, eax ; The x86-64 SysV ABI requires al = number of FP args passed in FP registers for variadic functions
call printf
dec ebx
jnz .print
pop rbx ; restore rbx, the one call-preserved reg we actually used.
xor eax,eax ; successful exit status.
ret
section .rodata ; it's usually best to put constant data in a separate section of the text segment, not right next to code.
format:
db "%ld", 10, 0
To speed this up, we should take advantage of the redundancy in converting consecutive integers to strings. Since "5000000\n"
is only 8 bytes long (including the newline), the string representation fits in a 64-bit register.
We can store that string into a buffer and increment a pointer by the string length. (Since it will get shorter for smaller numbers, just keep the current string length in a register, which you can update in the special-case branch where it changes.)
We can decrement the string representation in-place to avoid ever (re)doing the process of dividing by 10 to turn an integer into a decimal string.
Since carry/borrow doesn't naturally propagate inside a register, and the AAS instruction isn't available in 64-bit mode (and only worked on AX, not even EAX, and is slow), we have to do it ourselves. We're decrementing by 1 every time, so we know what's going to happen. We can handle the least-significant digit by unrolling 10 times, so there's no branching to handle it.
Also note that since we want to the digits in printing order, carry goes the wrong direction anyway, since x86 is little-endian. If there was a good way to take advantage of having our string in the other byte order, we could maybe use BSWAP or MOVBE. (But note that MOVBE r64 is 3 fused-domain uops on Skylake, 2 of them ALU uops. BSWAP r64 is also 2 uops.)
Perhaps we should be doing odd/even counters in parallel, in two halves of an XMM vector register. But that stops working well once the string is shorter than 8B. Storing one number-string at a time, we can easily overlap. Still, we could do the carry-propagation stuff in a vector reg and store the two halves separately with MOVQ and MOVHPS. Or since 4/5th of the numbers from 0 to 5M are 7 digits, it might be worth having code for the special case where we can store a whole 16B vector of two numbers.
A better way to handle shorter strings: SSSE3 PSHUFB to shuffle the two strings to left-packed in a vector register, then a single MOVUPS to store two at once. The shuffle mask only needs to be updated when the string length (number of digits) changes, so the infrequently-executed carry-handling special case code can do that, too.
Vectorization of the hot part of the loop should be very straightforward and cheap, and should just about double performance.
;;; Optimized version: keep the string data in a register and modify it
;;; instead of doing the whole int->string conversion every time.
section .bss
printbuf: resb 1024*128 + 4096 ; Buffer size ~= half L2 cache size on Intel SnB-family. Or use a giant buffer that we write() once. Or maybe vmsplice to give it away to the kernel, since we only run once.
global main
extern printf
main:
push rbx
; use some REX-only regs for values that we're always going to use a REX prefix with anyway for 64-bit operand size.
mov rdx, `5000000\n` ; (NASM string constants as integers work like little-endian, so AL = '5' = 0x35 and the high byte holds '\n' = 10). Note that YASM doesn't support back-ticks for C-style backslash processing.
mov r9, 1<<56 ; decrement by 1 in the 2nd-last byte: LSB of the decimal string
;xor r9d, r9d
;bts r9, 56 ; IDK if this code-size optimization outside the loop would help or not.
mov eax, 8 ; string length.
mov edi, printbuf
.storeloop:
;; rdx = "????x9\n". We compute the start value for the next iteration, i.e. counter -= 10 in rdx.
mov r8, rdx
;; r8 = rdx. We modify it to have each last digit from 9 down to 0 in sequence, and store those strings in the buffer.
;; The string could be any length, always with the first ASCII digit in the low byte; our other constants are adjusted correctly for it
;; narrower than 8B means that our stores overlap, but that's fine.
;; Starting from here to compute the next unrolled iteration's starting value takes the `sub r8, r9` instructions off the critical path, vs. if we started from r8 at the bottom of the loop. This gives out-of-order execution more to play with.
;; It means each loop iteration's sequence of subs and stores are a separate dependency chain (except for the store addresses, but OOO can get ahead on those because we only pointer-increment every 2 stores).
mov [rdi], r8
sub r8, r9 ; r8 = "xxx8\n"
mov [rdi + rax], r8 ; defer p += len by using a 2-reg addressing mode
sub r8, r9 ; r8 = "xxx7\n"
lea edi, [rdi + rax*2] ; if we had len*3 in another reg, we could defer this longer
;; our static buffer is guaranteed to be in the low 31 bits of address space so we can safely save a REX prefix on the LEA here. Normally you shouldn't truncate pointers to 32-bits, but you asked for the fastest possible. This won't hurt, and might help on some CPUs, especially with possible decode bottlenecks.
;; repeat that block 3 more times.
;; using a short inner loop for the 9..0 last digit might be a win on some CPUs (like maybe Core2), depending on their front-end loop-buffer capabilities if the frontend is a bottleneck at all here.
;; anyway, then for the last one:
mov [rdi], r8 ; r8 = "xxx1\n"
sub r8, r9
mov [rdi + rax], r8 ; r8 = "xxx0\n"
lea edi, [rdi + rax*2]
;; compute next iteration's RDX. It's probably a win to interleave some of this into the loop body, but out-of-order execution should do a reasonably good job here.
mov rcx, r9
shr rcx, 8 ; maybe hoist this constant out, too
; rcx = 1 in the second-lowest digit
sub rdx, rcx
; detect carry when '0' (0x30) - 1 = 0x2F by checking the low bit of the high nibble in that byte.
shl rcx, 5
test rdx, rcx
jz .carry_second_digit
; .carry_second_digit is some complicated code to propagate carry as far as it needs to go, up to the most-significant digit.
; when it's done, it re-enters the loop at the top, with eax and r9 set appropriately.
; it only runs once per 100 digits, so it doesn't have to be super-fast
; maybe only do buffer-length checks in the carry-handling branch,
; in which case the jz .carry can be jnz .storeloop
cmp edi, esi ; } while(p < endp)
jbe .storeloop
; write() system call on the buffer.
; Maybe need a loop around this instead of doing all 5M integer-strings in one giant buffer.
pop rbx
xor eax,eax ; successful exit status.
ret
This is not fully fleshed-out, but should give an idea of what might work well.
If vectorizing with SSE2, probably use a scalar integer register to keep track of when you need to break out and handle carry. i.e. a down-counter from 10.
Even this scalar version probably comes close to sustaining one store per clock, which saturates the store port. They're only 8B stores (and when the string gets shorter, the useful part is shorter than that), so we're definitely leaving performance on the table if we don't bottleneck on cache misses. But with a 3GHz CPU and dual channel DDR3-1600 (~25.6GB/s theoretical max bandwidth), 8B per clock is about enough to saturate main memory with a single core.
We could parallelize this, and break the 5M .. 1 range up into chunks. With some clever math, we can figure out what byte to write the first character of "2500000\n"
, or we could have each thread call write()
itself in the correct order. (Or use the same clever math to have them call pwrite(2)
independently with different file offsets, so the kernel takes care of all the synchronization for multiple writers to the same file.)