1

I wrote a printint function in 64-bit NASM that prints an integer to STDOUT. It's really slow though, and after doing some benchmarks I determined that converting an integer to a string is the slowest part by far.

My current strategy for converting ints to strings goes like this:

  1. If the number is 0, skip to a special case.
  2. If the number is negative, print a negative sign, turn it into a positive integer, and continue
  3. Create a 10-byte buffer (enough to hold all 32-bit integers) and a pointer pointing to the back of the buffer
  4. Check if the number is 0; if it is, we're done.
  5. Divide the number by 10, convert the remainder into ASCII
  6. Put the remainder into the buffer (from back to front)
  7. Decrement the buffer pointer
  8. Loop back to step 4

I've tried Googling for how other people do it and it's more or less similar to what I do, dividing by 10 until the number is 0.

Here's the relevant code:

printint:                       ; num in edi
    push rbp                    ; save base pointer
    mov rbp, rsp                ; place base pointer on stack
    sub rsp, 20                 ; align stack to keep 20 bytes for buffering
    cmp edi, 0                  ; compare num to 0
    je _printint_zero           ; 0 is special case
    cmp edi, 0
    jg _printint_pos            ; don't print negative sign if positive

    ; print a negative sign (code not relevant)
    xor edi, -1                 ; convert into positive integer
    add edi, 1
_printint_pos:
    mov rbx, rsp                ; set rbx to point to the end of the buffer
    add rbx, 17
    mov qword [rsp+8], 0        ; clear the buffer
    mov word [rsp+16], 0        ; 10 bytes from [8,18)
_printint_loop:
    cmp edi, 0                  ; compare edi to 0
    je _printint_done           ; if edi == 0 then we are done
    xor edx, edx                ; prepare eax and edx for division
    mov eax, edi
    mov ecx, 10
    div ecx                     ; divide and remainder by 10
    mov edi, eax                ; move quotient back to edi
    add dl, 48                  ; convert remainder to ascii
    mov byte [rbx], dl          ; move remainder to buffer
    dec rbx                     ; shift 1 position to the left in buffer
    jmp _printint_loop
_printint_done:
    ; print the buffer (code not relevant)
    mov rsp, rbp                ; restore stack and base pointers
    pop rbp
    ret

How can I optimize it so that it can run much faster? Alternatively, is there a significantly better method to convert an integer to a string?

I do not want to use printf or any other function in the C standard library

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
jdabtieu
  • 534
  • 5
  • 20
  • 1
    This site has [a vast number of questions about decimal conversion in assembly](https://stackoverflow.com/search?q=decimal+%5Bassembly%5D). You might in particular like to look at https://stackoverflow.com/questions/13166064/how-do-i-print-an-integer-in-assembly-level-programming-without-printf-from-the and https://codereview.stackexchange.com/questions/142842/integer-to-ascii-algorithm-x86-assembly – Nate Eldredge Feb 19 '21 at 00:32
  • 2
    Probably the biggest change is to avoid the expensive divide by using ["magic number" reciprocal multiplication](https://stackoverflow.com/questions/41183935/why-does-gcc-use-multiplication-by-a-strange-number-in-implementing-integer-divi) instead. – Nate Eldredge Feb 19 '21 at 00:35
  • @Nate Eldredge Thanks for the links, the magic number looks very promising. I actually stumbled across the first linked post, but the tricky bit was implementing divide by 10 quickly...I will try the magic number and see how much performance gain I can get – jdabtieu Feb 19 '21 at 01:09
  • My answer on [How do I print an integer in Assembly Level Programming without printf from the c library?](https://stackoverflow.com/a/46301894) (Nate's first link) actually does mention performance and the multiplicative inverse trick, and links [Why does GCC use multiplication by a strange number in implementing integer division?](https://stackoverflow.com/q/41183935). – Peter Cordes Feb 19 '21 at 02:15
  • 2
    That answer of mine only handles the unsigned part of the problem, but avoids some of the loop inefficiencies you have. ([Why are loops always compiled into "do...while" style (tail jump)?](https://stackoverflow.com/q/47783926)). Also note that `neg edi` is more efficient than `xor edi, -1` / `add edi,1`. Just use `neg` to subtract from 0 like a normal person; 2's complement identities are not something you want to normally want to actually use, except to optimize `-x - 1` into `not edi` or whatever. (https://godbolt.org/z/x6r1aE) Not to use *more* instructions to do the same work :P – Peter Cordes Feb 19 '21 at 02:21
  • Although a masked xor/sub can let you do a branchless `abs`, with either `0` or `-1` from `x>>31`. https://godbolt.org/z/6eqEqa – Peter Cordes Feb 19 '21 at 02:23
  • 1
    instead of dividing by 10 just divide by a larger base like 100 or 1000 and use a big lookup table – phuclv Feb 19 '21 at 02:38
  • @phuclv: Have you tried that? I've wondered whether it was worth dividing by 100 and breaking up that remainder with another multiply magic-constant, creating more ILP. (Especially for 64-bit integers on older / low-power CPUs with slower `mul r64`. Or especially on ARM where `mls` multiply-subtract is available to make the remainder part more efficient. Or using a more efficient inverse with less shifting that's known to be accurate for 0..99 but not the full range, so just an `imul r32, r32, imm` and maybe one shift.) I hadn't considered a table lookup, but could work if doing a lot. – Peter Cordes Feb 19 '21 at 02:51
  • *that prints an integer to STDOUT* - Are you *sure* the system call itself isn't dominating your run times? Especially if you `write` for each number separately instead of using `fwrite` or manually building up a large buffer. `div` is slow but not *that* slow, compared to I/O, unless you're already using buffering. Also BTW, `rbx` is call-preserved; pick a different register for your pointer or save/restore `rbx`. – Peter Cordes Feb 19 '21 at 02:54
  • 2
    @PeterCordes I haven't tried it but many people have done that https://tia.mat.br/posts/2014/06/23/integer_to_string_conversion.html https://pvk.ca/Blog/2017/12/22/appnexus-common-framework-its-out-also-how-to-print-integers-faster/ – phuclv Feb 19 '21 at 03:02
  • @PeterCordes It's roughly 50/50 between the integer conversion and the syscall. I will use a buffer but haven't gotten around to actually implementing it yet, the purpose of this post was to determine faster ways to convert an integer to a string. I'm able to save a few clock cycles per loop iteration using the magic number, and with the loop/misc optimizations I think I'll get close to my performance target. – jdabtieu Feb 19 '21 at 15:18
  • @phuclv Based on the pvk site it doesn't look like using a larger base would help much, compared to naive with magic number – jdabtieu Feb 19 '21 at 15:18
  • @jdabtieu it's because those functions were written in C without hand-optimized assembly or vectorization. I believe it'll be much faster after optimization – phuclv Feb 19 '21 at 15:38
  • @phuclv ah, that makes sense. i'll definitely look into it then – jdabtieu Feb 19 '21 at 17:06
  • Compilers always use a multiplicative inverse for compile-time-constant divisors because that's the fastest way to implement `x /= 10`. Optimizing compilers can make loops similar to the hand-written asm in the linked answers if you give it source that uses a pointer-decrement. (Like the simple C in [How do I print an integer in Assembly Level Programming without printf from the c library?](https://stackoverflow.com/a/46301894)) I think @phuclv is being overly pessimistic about what compilers can do, although there may be some further tricks that compilers won't find.) – Peter Cordes Feb 19 '21 at 21:40
  • 2
    Also note that making a system call serializes out-of-order execution, so buffering your output will probably help more than you expect and let the CPU do what it's good at, finding instruction-level parallelism across iterations. (Although some of that relies on correct branch prediction.) – Peter Cordes Feb 19 '21 at 21:42
  • @PeterCordes yep, I changed my readint function (not featured in this post) to use a 1024 byte buffer instead of reading one character at a time and observed a 7-8x speedup. my goal for this entire project is to create functions to read and write integers to stdout, matching or beating the performance of scanf and printf. given that buffering sped up reading by a lot, i'd expect it to also speed up writing significantly too – jdabtieu Feb 19 '21 at 22:21
  • 2
    Thanks everyone for the suggestions. Even though the syscalls ended up being the problem, I appreciate the feedback a lot and they did end up being useful. – jdabtieu Feb 20 '21 at 00:45

1 Answers1

2

Turns out I was wrong about the source of the bottleneck. My benchmark was flawed. Although micro-optimizations such as magic number multiplication and better loops did help, the biggest bottleneck was the syscalls.

By using buffered reading & writing (buffer size of 16 kB), I was able to achieve my goal of reading and printing integers faster than scanf and printf.

Creating an output buffer sped up one particular benchmark by over 4x, whereas the micro-optimizations sped it up by about 25%.

For anyone who stumbles across this post in the future, here are the optimizations I made:

  1. Replaced all sys_write calls with a write_buf call instead, where write_buf writes the output to a buffer and only prints the buffer if it becomes full. The implementation of such a write_buf function is left as an exercise to the reader.
  2. Replaced the division with magic number multiplication, which saves a few clock cycles per loop iteration.
  3. Changed while loop into do...while loops instead, saving one jmp instruction per loop iteration (adds up quite a bit over time).
  4. Optimizing individual instructions (e.g. using neg instead of xor, add) and removing redundant instructions.

Another potential improvement I could make (but didn't) is dividing by a larger base and using a lookup table, as mentioned by phuclv in the comments.

jdabtieu
  • 534
  • 5
  • 20
  • 1
    Yup, those numbers are closer to what I was expecting, much more to gain from minimizing system calls (which take hundreds of cycles, or thousands or more with Meltdown + Spectre mitigations enabled)! Note that older CPUs had even slower `div`, so it may have helped even more in the past to avoid it. And/or some of the cost of the loop is in branch mispredicts from numbers with more or fewer digits, not just the div or mul dep chain itself, limiting the possible speedup. – Peter Cordes Feb 20 '21 at 02:32