Optimizing an incrementing ASCII decimal counter in video RAM on 7th gen Intel Core

Question

I'm trying to optimize the following subroutine for a specific Kaby Lake CPU (i5-7300HQ), ideally to make the code at least 10 times faster compared to its original form. The code runs as a floppy-style bootloader in 16-bit real mode. It displays a ten digit decimal counter on screen, counting 0 - 9999999999 and then halting.

I have taken a look at Agner's Optimization Guides for Microarchitecture and Assembly, Instruction Performance Table and Intel's Optimization Reference Manual.

Only sensible optimization I've been able to do so far is swapping the loop instruction for dec + jnz, explanation here.

Another possible optimization might be swapping the lodsb for mov + dec, but the information I've found about that has been conflicting, with some saying it helps slightly and others that it might actually hurt the performance on modern CPUs.

I also tried switching to 32-bit mode and keeping the entire counter in an unused register pair to eliminate any memory access, but after reading into it a bit I realized that those ten bits will get cached immediately and the difference in latency between L1 cache and registers is only about a factor of three, so definitely not worth the added overhead of working with the counter in that format.

(editor's note: add reg latency is 1 cycle, add [mem] latency is about 6 cycles, including the 5 cycle store-forwarding latency. Or much worse if [mem] is uncacheable like video RAM.)

org 7c00h

pos equ 2*(2*80-2)  ;address on screen

;init
cli
mov ax,3
int 10h
mov ax,0b800h
mov es,ax
jmp 0:start

start:
    push cs
    pop ds
    std

    mov ah, 4Eh
    xor cx, cx
    mov bl,'9'

countloop:
    mov cl,10           ;number of digits to add to
    mov si,counter+9    ;start of counter
    mov di,pos          ;screen position

    stc                 ;set carry for first adc
next_digit:
    lodsb               ;load digit
    adc al,0
    cmp bl, al
    jnc print
    add al,-10          ;propagate carry if resulting digit > 9
print:
    mov [si+1],al       ;save new digit
    stosw               ;print

    ;replaced loop with a faster equivalent
    ;loop next_digit
    dec cl
    jnz next_digit

    jnc countloop

    jmp $

counter:
    times 10 db '0'

    times 510-($-$$) db 0
    dw 0aa55h

My question is - what can I do to achieve the desired increase in speed? What other materials can I study to gain more understanding of the underlying concepts?

Note: this is a school assignment. While a straight answer would definitely help, I'd much more appreciate explanations or pointers to relevant study material, as we have been given none.

EDIT: Changed code to a minimal reproducible example

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/212707/discussion-on-question-by-eldan-optimizing-an-incrementing-ascii-decimal-counter). — Samuel Liew, Apr 28 '20 at 14:34
I think it would be better to post your answer as an *answer*, not as part of the question. This change to the question maybe sort of turns it into a code-review request. (But there's a separate site for that: https://codereview.stackexchange.com/) — Peter Cordes, May 01 '20 at 10:48
I hadn't noticed you were going `cli` before. That could be why setting VRAM to WC never flushed the buffer: no interrupts, not even keyboard. (The `iret` in any interrupt return is serializing). — Peter Cordes, May 01 '20 at 10:55

fuz · Answer 1 · 2020-04-27T17:16:33.870

Here's my take on it. The following optimisations have been applied:

the least significant digit has been unrolled completely for best performance
the remaining digits have been unrolled to one section per digit
BCD arithmetic has been used to reduce the code to one conditional branch per digit
segment usage has been shuffled around to reduce the number of prefixes used
instruction order has been optimised to move long-latency instructions out of the critical path

Additionally I have altered the code to be a COM binary for easier testing. Turning it back into a boot loader is left as an exercise to the reader. One thing you can do once it's a boot loader is fixing the code such that CS and SS have a segment base of 0000. This avoids a penalty on loads and stores on some microarchitectures.

        org     100h

pos     equ     2*(2*80-12)             ; address on screen

        mov     ax, 3                   ; set up video mode
        int     10h
        mov     ax, 0b800h
        mov     ds, ax
        mov     es, ax

        mov     di, pos
        mov     ax, 4e30h               ; '0' + attribute byte 4e
        mov     cx, 10
        cld
        rep     stosw                   ; set up initial display

        xor     ax, ax
        sub     sp, 10
        push    ax
        push    ax
        push    ax
        push    ax
        push    ax
        mov     bp, sp                  ; set up counter

        dec     di
        dec     di                      ; di points to the last digit on screen
        mov     bx, digits              ; translation table

        jmp     countloop

%macro  docarry 1                       ; digits other than the last one
        mov     al, [bp+%1]             ; second to last digit
        inc     ax                      ; add carry to al
        aaa                             ; generate BCD carry
        mov     [bp+%1], al             ; desposit to counter
        cs xlat                         ; generate ASCII digit
        mov     [di-2*9+2*%1], al       ; display digit
        jnc     countloop               ; exit when carry dies
%endm

docarry2:                               ; place this here so jumps are in range
        docarry 2
        docarry 1
        docarry 0
        int     20h

        align   16                      ; for performance
countloop:
        mov     [di], byte '0'          ; treat last digit separately
        mov     [di], byte '1'
        mov     [di], byte '2'
        mov     [di], byte '3'
        mov     [di], byte '4'
        mov     [di], byte '5'
        mov     [di], byte '6'
        mov     [di], byte '7'
        mov     [di], byte '8'
        mov     [di], byte '9'

        docarry 8
        docarry 7
        docarry 6
        docarry 5
        docarry 4
        docarry 3
        jmp     docarry2

digits:
        db      '0123456789'

This increases the speed by a factor of about 30 compared to the original code on my 8 MHz 80286 based machine and manages to increment the counter about 329000 times per second (about 3.04 µs per digit). It's going to be a bit hard to test on a modern system, but I'll try to find a solution.

A LUT for `digits` might be good on 286, but it's definitely worse for base 10 on a Skylake. For testing on a modern system, I was thinking of running it in 32-bit mode with `movnti` to simulate writes to WC video RAM. That might allow write-combining so digits never show on screen, but with a video refresh rate of 60Hz you can't really tell the difference. — Peter Cordes, Apr 27 '20 at 22:06
If you have a VM, that can let 16-bit code execute natively, but the stores to video RAM will be to a virtualized video card. So that probably doesn't help. — Peter Cordes, Apr 27 '20 at 22:11
@PeterCordes The LUT is used to avoid trashing the flags. An addition plus extra comparison could possibly used on modern targets, but I suppose the limiting factor is the time it takes to write to video memory. Since that write goes over the PCIe bus, it's going to be serialised anyway, so a little extra latency should not make a difference. I wonder if it would help to combine pairs or quartets of writes to reduce the number of bus transactions though. — fuz, Apr 27 '20 at 23:41
Also, none of the display writes dependent on `xlat` are on the critical path, so it shouldn't really make a difference in the overall latency anyway. — fuz, Apr 27 '20 at 23:43
Could maybe use LEA if you used BX, but ok. In [a comment on the question](https://stackoverflow.com/questions/61460126/optimizing-an-incrementing-ascii-decimal-counter-in-video-ram-on-7th-gen-intel-c/61463306#comment108738607_61460126) I posted a Linux port of the OP's code that updates a buffer in BSS using `movnti` to simulate VRAM. It never writes a full line so it doesn't flush the WC buffer, and runs at ~2.6 IPC on Skylake. (Or if I use `stosw` instead of movnti, we get self-modifying code pipeline nukes. But the movnti was to different memory...) — Peter Cordes, Apr 27 '20 at 23:52

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

If a counter ticks in the forest, does anyone see it?

our requirements state that every single change of a number has to be visible on screen

Your screen's refresh rate is probably 60Hz, maybe as high as 144Hz. Changing video RAM any faster than that will leave some counts unread by the hardware scan out loop over the framebuffer¹, never sent to a physical screen, and never turning into a pattern of photons of visible light that a high-speed camera could record.

Footnote 1: Or the virtual equivalent if VGA text mode is emulated somehow on top of hardware that only knows how to draw pixels. Asked Does modern PC video hardware support VGA text mode in HW, or does the BIOS emulate it (with System Management Mode)? as a followup.

If we don't accept this limit of 1 increment per 16.66.. ms (60 Hz), we need to decide what we're willing to bottleneck on vs. what we can sidestep.

Certainly we need to do the actual work of having the ASCII digits calculated, not just incrementing a binary counter and formatting it into a string occasionally in a timer or vertical blanking interrupt (once per screen refresh). That wouldn't satisfy the spirit of the assignment.

Or what if we calculate the ASCII digits purely in registers and only do mov stores in a timer or vblank interrupt? That would sample the fast-incrementing counter asynchronously from its increments so you'd visually see all the low digits changing. (Which is a pretty clear minimum requirement).

Omitting stores from the actual loop still doesn't feel like it hits the spirit of the assignment. I think our loop should, if run on its own with no fancy hardware setup, truly get every count all the way to video RAM. That seems uncontroversial. That's what the original code does.

The CPU can be configured to do write-combining with MTRRs. Some desktops had a BIOS option to set the AGP GART as UC (UnCacheable) vs. WC (calling it "USWC = Uncacheable Speculative Write Combining"). This BIOS-tuning article has a section on it. It seems modern firmware leaves VGA memory UC, letting OSes / graphics drivers set up MTRRs / PAT.

Unfortunately, making VGA memory WC works too well and the stores never make it out of the CPU core's write-combining buffer. (An LFB since this is an Intel CPU.) We can flush manually after every store with a memory barrier like mfence, or clflushopt with the address of the cache line. But then we're back where we started because on the OP's Kaby Lake iGPU / firmware, it seems that flushing a WC store costs about the same as just doing a UC store costs.

Of course we only have to flush when the whole counter is in sync, after updating all the digits if a carry rippled far. If we were storing each digit separately this could speed us up by 11.111% if I have my math right vs. UC memory. Or if we were doing dword stores of 2 digits at once, by 1.0101% because we only need an extra store every 100 counts, not every 10.

I think we can capture the spirit of the assignment while still letting the hardware optimize our stores by using a WC framebuffer and flushing in a timer or vblank interrupt.

This means we're incrementing a counter very fast (nearly 1 count per core clock cycle with a careful implementation). And we sample that counter by merely using a memory barrier or serializing instruction in an interrupt handler that runs right before the video hardware starts a new pass at the top left of the screen, scanning out a new frame. In fact iret is serializing so merely returning from an empty interrupt handler will do the job. Holding down a key on the keyboard may even make the counter updates visible on screen (where they weren't otherwise) if you used the MTRR to make video RAM WC but didn't program a timer or vertical-blanking interrupt to fire periodically.

Using clflush or mfence from an outer level of the loop wouldn't work well; that would be synchronous with the increments and would thus leave the low digits always zero. It would make the fact that we only sometimes flush explicit in the loop, instead of leaving the flushing as something that happens because of interrupts that are part of normal system operation. (Or at least they would be if this bootloader wasn't literally the only thing running. e.g. if run under DOS you'd have a timer interrupt every few ms.)

If we insist on flushing to video RAM every count (either by leaving it UC, or manually with WC + explicit flushes in the loop), the only optimization that would matter is reducing the number of stores to video RAM. i.e. by not updating digits that aren't changing. The original code stores every digit every time, so fixing that should give very close to a 10x speedup.

Even just storing to uncacheable DRAM or making a PCIe transaction is much slower than anything you could optimize inside the loop, even a self-modifying-code machine clear. And if storing to a VGA text framebuffer triggers a System Management Mode Interrupt (SMI) to emulate text mode by updating a real pixel framebuffer, the cost of a store to the frame is astronomical compared to anything else you could do in the loop. This might well be how firmware for out Skylake / Kaby Lake integrated GPUs works: Does modern PC video hardware support VGA text mode in HW, or does the BIOS emulate it (with System Management Mode)?

Allowing the hardware to do write-combining on our stores to VRAM is thus essential to making this optimization problem interesting beyond that one algorithmic tweak.

To do this, program the MTRR for the VGA framebuffer. https://wiki.osdev.org/MTRR documents the actual MSRs you can use with the wrmsr instruction. I think each MSR has a bit-field of 8 regions. The one you want is IA32_MTRR_FIX16K_A0000, in MSR[259] - 8 regions of 16 KB each (128 KB total) which include the linear address block B8000 that holds VGA text-mode memory. Figure 11-8 in Intel's SDM vol 3 documents the layout.

Assuming WC video memory (or for updating WB cacheable memory)

There are lots of things to improve, but two critical things:

Micro-architectural: Self-modifying code pipeline nukes, aka machine clears, from count[] being in the same 64B cache line as your main loop (~50x performance with no other changes.) Without changing this, it's hard to see any gains from any other micro-optimizations.
Algorithmic: Don't blindly propagate carry all the way up through every digit every time: 90% of the increments don't carry at all, 99% carry only 1 place, etc. Nested loops to handle the low digits can run very efficiently, just incrementing their own digit counter and having the outer loop reset it to '0', no need to explicitly propagate those carries with adc. Keeping those ASCII digits in registers also avoids the need to load/store them to counts[], just pure stores to video RAM, like mov [di-4], eax.

With very efficient inner loops for the low digits, performance of the upper 6 or 7 digits becomes nearly irrelevant. That part runs once per 10k or 1k increments so its cost is amortized. (~19x speedup for aggressively optimized inner loops vs. a micro-optimized version of your original loop that saves some uops and avoids some bottlenecks without changing the algorithm.)

Other micro-optimizations of your original (after fixing the SMC machine clears) gave a factor of ~1.5x speedup: making the carry branch normally not-taken, saving some uops, avoiding some partial-register false dependencies from lodsb and writing 16-bit partial registers.

With the optimized 4 levels of inner loops I rewrote from scratch, my version is about 29x faster on Skylake / Kaby Lake than the no-SMC-stall version of the original, or ~1500x faster than the true original. There's certainly a middle ground where you do adc carry propagation but take an early out when CF==0; I didn't try to implement that.

Tested in 32-bit mode, but the same code assembled for 16-bit mode should execute the same way, including the SMC stalls in your original. (Assuming WC stores don't trigger an SMI until flushed, and that the WC buffer keeps the stores local inside the core so ~1 store / clock is possible just like with WB memory.)

SKL and KBL are clock-for-clock identical in perf, same microarchitecture, so my test results should be reproducible for you. I did assemble your code in 16-bit mode to see alignment: it looks like your loop will have some bytes of count[] in the same 64-byte cache line as the end of the loop, hence a SMC pipeline nuke per iteration for most digits.

I adapted your original code so I could run the same loop in 32-bit mode under Linux, making it possible to use perf to profile with HW performance counters. The first step in optimizing anything is to get a baseline measurement. Since you mention some micro-optimizations for the micro-architectural reasons, we want perf counters not just total time. We can't easily get that in a bootloader on bare metal. Possibly in a guest VM, but then you'd be storing to a virtual VGA device, not real hardware, so it's probably not different from using normal or NT stores on normal WB memory in user-space under Linux.

perf stat -I1000 to show counters for amount of work done every second is a handy way to compare speed for tweaks that don't change the algorithm or number of branches. Look at counts for branches in 1 second to see relative speed of the loop, or divide that by cycles.

I used movnti to try to simulate a store to WC video RAM (uncacheable speculative Write-Combining, instead of normal WB = write-back cacheable). I think normal stores to WC memory regions behave like movnt stores. movnt stores that don't complete a cache line can keep updating the same write-combining LFB without actually flushing to memory. So it's similar to a normal store to WB memory that can hit in L1d cache.

SMI trapping of framebuffer stores (if done at all) is done by hardware outside the CPU core, probably the System Agent, so it doesn't fire until the core flushes. Or if no SMI trap, then probably it just goes to DRAM on our iGPU systems. Or over a PCIe bus to get to video RAM on a separate card.

Versions timed under GNU/Linux kernel 5.5.10 on i7-6700k on a somewhat idle system at ~4.2GHz

DRAM and cache are barely involved, and the system was idle enough that nothing was taking cycles on the other logical core of the physical core, so the code had a whole CPU to itself the whole time to spam stores into a write-combining buffer.

Original version, ported to run in 32-bit user-space: Godbolt - not fully timed, but perf stat -I1000 to print stats per second shows it's running about 52x slower than with align 64 before counter:. The pipeline nuke may include flushing WC buffers which would mean going to DRAM as well.
Original version, with SMC pipeline nukes avoided: ~85.7 seconds, ~358 billion core clock cycles for 10^10 counts. 2.66 IPC
Micro-optimized version of that: Godbolt - ~55.3 seconds, ~231 billion clock cycles for 10^10 counts. 4.56 IPC (but with more simpler instructions, not lodsb)
New inner loops, empty placeholder outer loop: Godbolt - ~2.93 seconds, ~12.25 billion core clock cycles. 2.73 IPC

The optimized version achieves close to 3 stores per 4 clocks. (Counting the low 2 digits from 00..99 takes 100 stores, the way it does it. I didn't time these final versions with clflushopt.)

If you fixed some of the stalls and stopped your loop with CF==0, this would result in a bottleneck on store/reload (store-forwaring) latency to low element of the count array. You definitely want those in registers so they can be store-only, not load/adc/store.

TODO: comment and talk about the micro-optimizations that I applied for that version:

Why doesn't GCC use partial registers? / How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent - also lodsb sucks. lodsd/ q are ok. Use movzx to do narrow loads, instead of merging into the low byte. Fortunately inc/dec in an adc loop on Sandybridge-family is fine, not causing partial-flag stalls like it would on P6-family. Especially in Skylake which doesn't do flag-merging at all, instead just reading the CF and/or SPAZO parts of FLAGS separately if needed. (Consequence: cmovbe and cmova are 2 uops to read 2 integer inputs and CF + ZF; other cmov are only 1 uop.)
You can use 32-bit registers in 16-bit mode, you don't have to switch modes. The assembler just uses an operand-size prefix. Writing a 32-bit register has no dependency on the old value, but 16 or 8 does. I used this to break dependency chains that would otherwise be loop-carried, allowing the CPU to exploit the instruction-level parallelism (ILP) across loop iterations / http://www.lighterra.com/papers/modernmicroprocessors/.
Haswell/Skylake have taken branch throughput of 1/clock, but can run a not-taken and a taken in the same cycle. Lay out branches to favour not-taken on the fast path (always a good idea in general).
Which Intel microarchitecture introduced the ADC reg,0 single-uop special case? - adc al,0 is unfortunately 2 uops on Skylake, unlike adc eax,0 or adc bl,0. Crazy, right? This is basically a CPU performance bug or CPU missed-optimization by the hardware designers, where the special-case opcodes for smaller encodings decode worse.
32-byte aligned routine does not fit the uops cache - Intel's recent JCC erratum makes the idq.mite_uops perf event worth checking. Skylake used to be pretty robust against code alignment, but now it's horrible for high-throughput code.

Perf doesn't totally fall off a cliff, but a significant factor is possible because of front-end bottlenecks from having to use legacy decode for some 32-byte blocks of machine code that end with a jcc on a 32-byte boundary. I didn't spend a lot of effort on this optimization for this code, but the fast versions happen to avoid this problem according to perf counters.

My version with nested loops, testable on GNU/Linux

This is only the inner loop; the outer loop just repeats it 10^10 / 10k times with no actual outer-loop work. We only leave the inner 4 loops once per 10k increments so pretending that part takes zero time doesn't change the result particularly.

The same pattern of 2 nested levels of looping per register could be repeated more times, or just do a chain of adc like you were doing.

;; nasm -felf32 decimal-counter.asm
;; ld -N -melf_i386 -o decimal-counter decimal-counter.o
;; writeable text segment like a bootloader
;; runs in 32-bit mode with prefixes for 16-bit operand-size
;;
;; taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,resource_stalls.any:u,rs_events.empty_cycles:u,machine_clears.count:u -I1000 ./decimal-counter

%use smartalign
alignmode p6, 64

;org 7c00h

;pos equ vram + 2*(2*80-2)  ;address on screen
pos equ vram + 2*(2*80-4)  ;address on screen

    ; In GDB, use
    ; p ((char*)&vram) + 2*(2*80-4)-36

;init
;cli
;mov ax,3
;int 10h
;mov ax,0b800h
;mov es,ax
;jmp 0:start


 ; pick your poison, or let stores stay in the CPU, not reaching VRAM
%macro FLUSH 1
 ;  clflushopt %1           ; all the way to DRAM
 ;  mfence                  ; for mov to WB: just drain store buffer.  For WC or movnt, IDK how guaranteed it is to hit DRAM
;   lock xor byte [esp], 0   ; faster version of mfence (at least on Skylake)
%endmacro
;%define movnti mov         ; for experiments

global _start
align 512
_start:
;    push cs
;    pop ds
;    mov ebp, counter+9    ; save address in a register
;    mov edi,pos
    mov edi, pos - 10*4
    mov eax, '0_0_'
    mov ecx, 10
    rep stosw                   ; memset the digits in VRAM

    mov  ebp, 10000000000 / 10000     ; outer loop iterations
    mov edi, pos-4

;    mov ah, 4Eh         ; VGA attribute byte
;    mov eax, '____'

align 32
.outer:

    mov  edx, '0_0_'           ; thousands (low), hundreds (high) digits
.thousands:
 .hundreds:
    movnti  [edi-4], edx
    ; don't want to flush yet; only after low digits are updated
    add  edx, 1<<16

    mov  eax, '0_0_'            ; tens (low=AX), ones (high) digits
    .tens:
        .ones:                  ; do{
          movnti  [edi], eax         ; store low 2 digits
        FLUSH [edi]
          lea  ecx, [eax + (1<<16)]       ; off the critical path of the EAX dep chain
          movnti  [edi], ecx
        FLUSH [edi]
          add  eax, 2<<16               ; unroll by 2
          cmp  eax, '9_'<<16
          jle  .ones            ; }while(ones<='9')
                   ; mov byte [edi+2], '9'    ; peel the last 2 iterations?

        add  eax, ('1_0_') - ('0_0_' + (10<<16))     ; increment the more-significant digit (AL), resetting less-significant digit back to '0'
        cmp  al, '9'
        jle  .tens

    cmp  edx, '9_9_'
    jle  .hundreds

    add  edx, ('1_0_') - ('0_0_' + (10<<16))     ; increment the more-significant digit (DL), resetting less-significant digit back to '0'
    cmp  dl, '9'
    jle  .thousands

;; TODO: increment the high 6 digits, propagating carry.  Possibly clflushopt here only?
;    pause
    dec ebp
    jnz .outer
    ;    jmp $
    mov eax, 1
    int 0x80


;section .data   ; avoids machine clears
    ; in original 16-bit code: counter starts at 00000037 30<rept>, ends at 00000040 (inclusive), in same cache line as the loop
align 64
counter:
    times 10 db '0'
;section .text

    times 510-($-$$) db 0
    dw 0aa55h

section .bss
vram:   resw 80*25

I have tested that this works for the low digits, single-stepping it in GDB and using display ((char*)&vram) + 2*(2*80-4)-36 or something like that to show the contents of that part of BSS as a string every step.

Using dword stores means that when the ones place, wraps we don't need a separate store to update the tens place. It just has to update the low byte of the same register and let the first iteration of the inner loop do that store.

During rollover from 0099 to 0100, memory contents are temporarily 0199. But unless you use SSE to store 16 bytes at once, you can't really avoid one problem or the other. The other option would be to somehow arrange for 0000 before 0100, but that might waste a store to the tens/ones dword in the hundreds loop.

I tried taking the original code without any changes and putting `align 64` and `align 256` before the counter. In neither case did I get any significant improvement, definitely not a 50x increase in speed. Did I misunderstand something in your answer? — Dex, Apr 28 '20 at 16:55
@Eldan: Maybe there's some reason why self-modifying-code stalls don't actually happen in the 16-bit bootloader version of this code? If so, then there's no slowdown from that to fix in the first place and we can "only" get the 29x speedup from the other change :P If writing to actual VGA memory is uncacheable (not WC), or otherwise *doesn't* get eaten up by write-combining buffers, that could possibly get those stores out of the pipeline before execution gets close (and be a huge bottleneck that hides most inefficiency other than writing digits in video memory that haven't changed.) — Peter Cordes, Apr 28 '20 at 23:04
I tried using efficient inner loops (values in registry, reading/writing as DWORDs of two values and two pieces of color information) and, surprisingly got no real speedup from them other than the already known x9.7 from reducing unnecessary loops. Other than that I tried switching to a BCD-based aproach instead of cmp, that also granted me no difference in performance at all. I am starting to think that many of the optimization features in the CPU might only be enabled after leaving the 16bit realmode in the bootloader. Can you try to run this as a bootloader on real hw? — Dex, Apr 28 '20 at 23:38
@Eldan: I'm sure that CPUs still operate the same way in 16-bit real mode, with superscalar out-of-order exec. **My guess is that your video RAM is mapped UC (uncacheable)** not WC, so the only significant improvement is fewer total stores to video RAM. That would perfectly explain your ~10x spedup. The amount of instructions between stores to video RAM is nearly insignificant. I can simulate that with `mfence` or other memory barrier in 32-bit mode where it kills performance by a factor of ~130 (at 4.2GHz, less at lower clock speed where the CPU isn't going as many times faster than RAM) — Peter Cordes, Apr 29 '20 at 00:27
@Eldan: And sorry no, I don't want to reboot my desktop ATM, and don't feel like digging up an old Core 2 laptop. I think UC video RAM is a sufficient explanation for the effect you're seeing so there's really no room to gain anything more. Maybe only that last `.3x` of the speedup by using MMX or SSE to do 8 or 16 byte stores, so we only need more than 1 store every 10k iterations instead of every 100 or every 10. — Peter Cordes, Apr 29 '20 at 00:34
@Eldan: I just remembered that some of my older desktops have an option to make video memory WC (aka USWC). See updated answer. Also, your bootloader could program the MTRRs to make that region of physical address space WC instead of UC on your real hardware, giving that ~100x speedup if you avoid SMC stalls. — Peter Cordes, Apr 29 '20 at 03:21
Thank you for the explanation. What kind of information should be written to MSR[259]? I wasn't able to find that anywhere. — Dex, Apr 29 '20 at 15:29
@Eldan: I assume it's 8x 8-bit fields, but none of the "summary" documentation like wikipedia or osdev get into the low level details of the format of those fields. Obviously it's somewhere in Intel's SDM vol.1 or vol.3 manuals, but I didn't go looking. — Peter Cordes, Apr 29 '20 at 15:31
I found the value of the MSR described in figure 11-8 in SDM vol 3. It seems that the VRAM did get set to WC, because now there is no display output unless i manually flush the cache with `wbinvd`. The issue is that the flush is insanely slow, as it flushes all the caches, not just VRAM. Is there a way to make the WC VRAM actualy write back more often? — Dex, Apr 29 '20 at 16:33
@Eldan: Cool! And lol, yes `wbinvd` is insanely slow, flushing all caches from *all cores* even, so slow it requires kernel privileges to even execute in protected mode. I played with explicit flushes some on my desktop between `movnti` stores: `clflushopt [di]` flushes just that cache line. (And makes sure it makes it to real memory; it's usable for non-volatile DIMMs like Optane DC PM persistent memory (see [this answer for links](//superuser.com/questions/1389552/can-intel-optane-memory-compensate-for-less-ram/1391391#1391391)). `mfence` or a dummy `lock`ed are also mem barriers. — Peter Cordes, Apr 29 '20 at 17:02
`clflushopt` did indeed improve the performance. Right now I'm flushing after every cycle through the counter and I don't see any performance gains at all. That could mean my VRAM was WC all along or that I'm doing something wrong again. — Dex, Apr 29 '20 at 17:12
@Eldan: Updated my answer with a version of the code with a FLUSH macro that can use one of 3 instructions, or none to test the fast case. Also might be worth trying `mfence` or `lock xor byte [esp], 0` as memory barriers instead of clflushopt: with movnti stores to WB memory, `lock xor` is the fastest one by ~2x over clflushopt. I assume it makes it to VRAM. More likely your VRAM was originally UC, and explicit flushing with `clflushopt` on WC memory replicates the UC behaviour of waiting for data to get all the way to DRAM or device memory. — Peter Cordes, Apr 29 '20 at 17:17
If you set MTRR and don't use any flushing it's blazing fast, though, like 3 seconds, where it wasn't before? That seems to prove it wasn't WC by default and that MTRR had an effect. If you hit keys on the keyboard, does the interrupt handler end up doing enough to get the WC buffer evicted to VRAM? Maybe hold down a key to see the screen update :P — Peter Cordes, Apr 29 '20 at 17:19
What I meant is that after setting MTRR and flushing with `clflushopt` my performance is equal to what it was without doing any of this — Dex, Apr 29 '20 at 17:22
@Eldan: Yes, I understood that, and that makes sense. Getting data all the way to video RAM is inherently high latency. (High bandwidth is possible in general but probably not to the same cache line. Here it seems `clflushopt` waits was long as UC memory would before the next store can even start.) `lock xor byte [esp],0` might be a faster way to flush WC buffers to memory if that or `mfence` can have more stores in flight even to the same line. Do you have a discrete GPU (CPU has to go over PCIe), or is your "VRAM" actually still just main memory (connected to CPU)? — Peter Cordes, Apr 29 '20 at 17:28
I'm using the iGPU, so it should be main memory. Using `lock xor byte [esp],0` again lead to no noticeable improvement. — Dex, Apr 29 '20 at 17:33
@Eldan: Too bad. :( On my desktop, I get about 125M instructions / second running with `lock xor`, vs. 89M IPS with `clflushopt [edi]`. Not a huge diff but easily measurable with perf counters when storing to WB DRAM with `movnti` stores, which I hoped would be similar to how iGPU is set up. I flush only after modifying the ones / tends digits, with the other digits already updated but I am flushing for every increment of the ones digit. (Could hoist it out to the tens loop to go 10x faster.) — Peter Cordes, Apr 29 '20 at 17:39
Not flushing after every update of the ones digit would definitely greatly improve performance, but our requirements state that every single change of a number has to be visible on screen. Otherwise why even bother increasing by one if we can just increase by ten and get the needed improvement :) — Dex, Apr 29 '20 at 17:44
@Eldan: I philosophised some about that in my answer; at a 60Hz or even 144Hz refresh rate, even storing to UC memory is vastly faster than can actually be *visible* in photons coming out of the screen. One interesting point is that flushing due to timer or keyboard interrupts could sample the counter and prove that the low digits really were incrementing and could display as non-zero. (Sampling the low digits of a fast counter is more or less random if the timing of the sample isn't locked to the counter steps.) I had hoped that there'd be some timer ints on by default. — Peter Cordes, Apr 29 '20 at 17:52
@Eldan: Hmm, so maybe set up a timer or vblank interrupt to run `mfence` every 10 ms or something, letting the counter run free except for flushing close to vertical blank interval. Every screen refresh would show a new counter value, and the actual work of computing the ASCII digits for every number is done. Fulfills the requirement in spirit at least, IMO. — Peter Cordes, Apr 29 '20 at 17:55
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/212805/discussion-between-eldan-and-peter-cordes). — Dex, Apr 29 '20 at 18:05
@Eldan: I asked [Does modern PC video hardware support VGA text mode in HW, or does the BIOS emulate it (with System Management Mode)?](https://stackoverflow.com/q/61521819) as a followup about how hardware + firmware implement or emulate VGA text mode. It seems at least some systems do trap stores to the frame buffer with System Management Mode, making it vastly slower than just a store to DRAM. It would be funny if a retro 386 or 486 PC with hardware VGA was actually faster at this than your Kaby Lake :P — Peter Cordes, Apr 30 '20 at 11:00
@Eldan: Updated with discussion of what it can / can't be justified as an optimization. I think normal stores in the loop are essential, but we can install a vblank or timer interrupt handler and set that up to get actual flushes. — Peter Cordes, Apr 30 '20 at 16:25
Forgot to add: https://wiki.osdev.org/VGA_Hardware#Alphanumeric_Mode implies that a VGA text mode with the attribute plane *not* interleaved with the character plane might be possible. So we could do 4 bytes of characters in EAX. (My answer hit the 30k char limit so I can't fit that in until I triim some more fat...) — Peter Cordes, Apr 30 '20 at 16:26

Brendan · Answer 3 · 2020-05-01T01:29:07.890

When you write to the frame buffer, it's best to think of it as sending a packet on a network. The "write packet" has a header containing an address, a size, the data (plus maybe checksum/parity). If you write one byte, the data part of the packet will be dwarfed by the size of the packet header, so most bandwidth will be wasted. To get efficient use of available bandwidth you want fewer larger writes. Write combining can help (combining multiple small writes into a single large write for you) but it should be treated as a potential minor improvement after you optimize the writes yourself, not an excuse to fail to optimize the writes.

Assuming "generic 32-bit 80x86 CPU" (e.g. 80486 with no SSE or AVX); your main goal should be to arrange the data as five 32-bit writes; where each 32-bit write contains two "char + attribute" pairs. In other words, the writes should look a bit like:

    mov di,pos
    mov [di],eax
    mov [di+4],ebx
    mov [di+8],ecx
    mov [di+12],edx
    mov [di+16],esi

Note: There is nothing wrong with using 32 bit instructions in real mode or in 16 bit code (as long as the CPU is 80386 or later).

However; it's a counter. That means that 99% of the time you'd only need to do one write (which would also make write combining 99% worthless). More specifically, you only need the second write if the lowest 2 digits roll over (from "99" to "00"), and you only need the third write if the lowest 4 digits roll over (from "9999" to "0000"), etc.

So.. let's initialize a counter:

    mov di,pos
    mov eax,0x4E304E30
    mov ebx,0x4E304E30
    mov ecx,0x4E304E30
    mov edx,0x4E304E30
    mov esi,0x4E304E30
    mov [di],esi
    mov [di+4],edx
    mov [di+8],ecx
    mov [di+12],ebx
    mov [di+16],eax

Then you want to increment it and update the screen:

.update:
    add eax,0x00010000
    cmp eax,0x4E390000
    ja .digit1rollover
    jmp .done1

.digit1rollover:
    add eax,0x00000001-0x000A0000
    cmp al,0x39
    ja .digit2rollover
    jmp .done1

.digit2rollover:
    mov eax,0x4E304E30
    add ebx,0x00010000
    cmp ebx,0x4E390000
    ja .digit3rollover
    jmp .done2

.digit3rollover:
    add ebx,0x00000001-0x000A0000
    cmp bl,0x39
    ja .digit4rollover
    jmp .done2

.digit4rollover:
    mov ebx,0x4E304E30
    add ecx,0x00010000
    cmp ecx,0x4E390000
    ja .digit5rollover
    jmp .done3

.digit5rollover:
    add ecx,0x00000001-0x000A0000
    cmp cl,0x39
    ja .digit6rollover
    jmp .done3

.digit6rollover:
    mov ecx,0x4E304E30
    add edx,0x00010000
    cmp edx,0x4E390000
    ja .digit7rollover
    jmp .done4

.digit7rollover:
    add edx,0x00000001-0x000A0000
    cmp dl,0x39
    ja .digit8rollover
    jmp .done4

.digit8rollover:
    mov edx,0x4E304E30
    add esi,0x00010000
    cmp esi,0x4E390000
    ja .digit9rollover
    jmp .done5

.digit9rollover:
    add esi,0x00000001-0x000A0000
    cmp si,0x4E39
    ja .digit10rollover
    jmp .done5

.digit10rollover:
    mov esi,0x4E304E30
;   jmp .done5

.done5:
    mov [di],esi
.done4:
    mov [di+4],edx
.done3:
    mov [di+8],ecx
.done2:
    mov [di+12],ebx
.done1:
    mov [di+16],eax

You also want a loop around this. Fortunately bp/ebp is still unused so that's no problem (just don't forget to set bp to something in the initialization):

.done:
    dec bp
    jne .update

Remember the digits need to be in *printing* order, least significant at highest address `[di+16..19]`. Also affects the order within a dword; high half in the inner loop. The big code block near the end of my answer has a tested version of this that I've single-stepped with GDB to check that it goes `0_0_0_0` to `0_0_0_9` first, and so on (I used `_` instead of `0x4E` for easier readability). (And yes, I got it backwards the first try, too :P). Note that outer-loop updates don't need to store the inner counters; they can leave that for the next inner loop iteration. — Peter Cordes, May 01 '20 at 00:58
Also, https://wiki.osdev.org/VGA_Hardware#Alphanumeric_Mode implies that a VGA text mode with the attribute plane not interleaved with the character plane might be possible. If so we could do 4 bytes of characters in EAX, not redundantly storing the attribute bytes. (My answer hit the 30k char limit so I didn't fit that idea in there yet.) — Peter Cordes, May 01 '20 at 01:01
@PeterCordes: Argh - you're right (I got the order of characters wrong). For "de-interleaving planes", I wouldn't trust that "VGA compatible" is compatible enough - that same wiki page even documents differences in "chain 4 bit" handling between emulators, ATI and NVidia. — Brendan, May 01 '20 at 01:17
You have some jcc-over-a-jmp inefficiencies. e.g. `ja .digit7rollover` / `jmp .done4` could simply be `jna .done4`. Also, I think you need `dec bp` / `jnz .update` to count to 10e10, but it's only a 16-bit counter (and even 32-bit wouldn't be enough). You only need to check that you might be done when the MSD rolls over; otherwise you know you're not and can stay in the inner loop. — Peter Cordes, May 01 '20 at 02:44
(If you don't play tricks with WC + timer or vblank memory barrier, some of these inefficiencies don't matter, but I had fun optimizing the inner loop in my answer.) — Peter Cordes, May 01 '20 at 02:48
I ended up doing something very similar to your answer and that did solve the problem and provide a large enough speedup. It is also probably the easiest way of doing this, not requiring any advanced optimisation techniques. I put my full solution in the OP. — Dex, May 01 '20 at 10:46

Dex · Accepted Answer · 2020-05-01T11:14:34.297

Thanks to the feedback and discussion that took place here (especially thanks to Peter and his dedication), I was able to identify the main source of the slowdown - writing to VRAM, as that memory is uncacheable.

The only two meaningful optimizations are thus breaking out of the loop as soon as we lose the carry while adding (so that we don't unnecessarily add zero to every single digit and spend time printing it to screen) and combining as many WORD-sized writes into DWORD-sized ones. These two combined were able to push me across the 10x speedup mark.

My solution (speedup x10.3):

org 7c00h
bits 16             ;enables prefixes for 32bit instructions
pos equ 2*(2*80-2)  ;address on screen

;init textmode and vram, fix CS
cli
mov ax, 3
int 10h
mov ax, 0B800h
mov es, ax
jmp 0:start

start:
    ;fix segments and stack
    mov bp, 7C00h
    xor ax, ax
    mov ds, ax
    mov ss, ax
    mov sp, bp

    ;print initial zeroes
    std
    mov ax, (4Eh << 8) + '0'
    mov cx, 10
    mov di, pos
    sub di, 2
    rep stosw

    ;set color into upper byte of DX
    mov dh, 4Eh

counter_loop:
    cmp cx, 5           ;check whether we are incrementing the first two digits
    je two_digit_loop   ;if so, assume values are set correctly

    ;reset values back to start
    mov bx, counter     ;set counter pointer to first two digits
    mov ax, [bx]        ;load first two digits
    mov di, pos         ;set destination index to the position of the rightmost digit on the screen
    mov cx, 5           ;set number of digit pairs to 5

two_digit_loop:
    ;increment and adjust
    inc ax
    aaa
    jc carry

    ;no carry, update digits and return
    mov dl, al
    or dl, 30h              ;digit to ascii
    mov [es:di - 2], dx     ;write character to screen
    mov [bx], al            ;save value to memory
    jmp counter_loop

carry:
    mov edx, 4E304E30h      ;load '00' in colour
    mov [bx], ax            ;save value to memory
    cmp ax, 0A00h           ;test second digit overflow
    jge continue

    ;no carry on second digit, write and return
    or dl, ah               ;digit to ASCII if not 0x0A
    mov [es:di - 4], edx    ;write both characters at once
    jmp counter_loop

continue:
    ;propagate carry to next digit pair
    mov [es:di - 4], edx    ;write zero as both characters (double-sized write)
    mov [bx + 1], ch        ;save zero as upper value to memory

    ;continue to next digit pair
    add bx, 2           ;move memory to next digit pair
    mov ax, [bx]        ;load next digit pair
    sub di, 4           ;move display pointer by two char+colour pairs
    dec cx              ;and decrement counter
    jne two_digit_loop

    ;we ran out of digits to increment, display arrow and halt
    mov ax, 4E18h
    stosw
    jmp $

;counter, positioned at least 64B away from the code to prevent nuking the instruction pipeline
align 128
counter:
    times 10 db 0

times 510 - ($-$$) db 0
dw 0aa55h

It would be more efficient to always `mov [bx], ax` and do a word store, instead of sometimes doing an extra store of the high byte. Storing a byte is no faster than storing a word into cache, and it saves the code-size of doing `mov [bx + 1], ah` later. It also avoids a store-forwarding stall when you reload `ax` after storing just 1 byte. You do that store *after* storing to UC VRAM; if you stored AL or AH first then the store buffer would have drained while doing that UC store. — Peter Cordes, May 01 '20 at 11:12
Edited to reflect your suggestion. I have to note, however, that this has no significant impact on performance because of the VRAM slowness overshadowing any other improvements that are made. — Dex, May 01 '20 at 11:15
Yeah, of course it's pretty trivial, that's why I mentioned code-size as the first advantage. :P In the question you were talking about micro-optimizations like avoiding `loop`, so it seemed weird to post an answer with lots of inefficiency like that and multiple taken branches in the inner loop. (even though it's mostly dwarfed by the VRAM bottleneck) — Peter Cordes, May 01 '20 at 11:20
Frankly I was getting too tired of working on this almost nonstop for the past four days so I just went with whatever worked, overlooking all the minor imperfections. Especially since my measurement methods are flawed as there is no easy and reliable way to measure the runtime of this program (except maybe storing the tick count to memory before and after the run and subtracting them). I want to revisit this in a few days and try to squeeze the most out of it, but not today. — Dex, May 01 '20 at 12:09
Ok sure, that's fair. And yeah, you can use `rdtsc` before / after to record a wall-clock time in "reference cycles". See [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/a/51907627) for RDTSC background. Also, you could test that VGA stores aren't incrementing `MSR_SMI_COUNT` (0x34) to confirm Brendan's answer on [Does modern PC video hardware support VGA text mode in HW?](https://stackoverflow.com/q/61521819). `rdmsr` is easy to use, very much like `rdtsc`: https://www.felixcloutier.com/x86/rdmsr. Printing the results before/after is more work. — Peter Cordes, May 01 '20 at 12:24
@Eldan Are you sure that `ALIGN 128` does insert at least 64 bytes between the code and the *counter* variable? I count 128 or 125 code bytes (depending on those 3 `jmp` instructions being encoded with 3 bytes or 2 bytes. This makes for an alignment of just **0** or **3** bytes! Why don't you simply write: `times 500 - ($-$$) db 0` `counter: times 10 db 0` `dw 0aa55h`. — Sep Roland, May 03 '20 at 19:13