NASM x86 Assembly Optimization [Linear congruential generator]

Question

I have tried writing my first assembly program (NASM x86 assembly). It is an linear congruential generator, or the plan was that is is one. Because of limitations of my assembly skills I had to change the starting parameters of the pseudo random number generator (prng):

name: before: after

seed:214567:21456

a:22695477:16807

c:1:0

m:2^32:2^24

%include "io.inc"
section .data
;cheating with the numbers by putting a 0. infront of them so they look like floats
msg db "0.%d",10,0
msgDone db "done",10,0
msgStart db "start",10,0
;see https://en.wikipedia.org/wiki/Linear_congruential_generator
seed dd 21456
a dd 16807
c dd 0
m dd 16777216
section .text
global CMAIN
extern printf
CMAIN:
    mov ebp, esp; for correct debugging
    ;create stack frame
    push ebp
    mov ebp, esp
    ; edi is used for the loop count
    mov edi, 0
    push msgStart
    call printf
    generate:
        ;seed = (seed*a+b)%m
        mov eax, [seed]
        mov ecx,[a]
        ;seed*a
        mul ecx
        ;seed*a+b
        add eax,[c]
        mov ecx, [m]
        ;(seed*a+b)%m
        div ecx
        mov eax,edx
        mov [seed], edx
        ;push to print them out
        push eax;disable for testing
        push msg;disable for testing
        call printf ;disable for testing
        add esp,8
        inc edi
        cmp edi, 12 ; set this to 1000000000
    jne generate
    ;destroy stack frame
    push msgDone
    call printf
    xor eax, eax
    mov esp, ebp
    pop ebp
    ret

I have written the same code in java and in NASM x86 assembly and expected an performance increase relative to java with NASM. But in the end it took 3 times longer than java did. Nasm ~ 10sec , Java ~ 3-4 sec. My test was letting each program generate 1.000.000.000 random numbers. Of course without the System.out.prints/ call printf because they needed a lot of time.

I knew from the beginning that my assembly code would be badly optimized, because I have never programmed in it before and now just wanted to try it, but that the performance would be that bad is something I never had excpected.

Can someone tell me why my programm needs so much time and how assembly programs can be optimized?

Thanks

---The code was written with SASM on Win10 64bit

"Win10 64bit" - Why aren't you writing 64bit code then? Having more registers (r8-r15) can make your life a *lot* easier. And "dec edi ; jnz generate" gets rid of `cmp`. — David Wohlferd, Sep 30 '17 at 21:45
You're using `div` to divide by a power of 2. This is about 30x slower than a shift on Intel Haswell / AMD Bulldozer. I this is pretty much a duplicate of another question about optimizing asm and using `div` instead of a shift. Good question BTW; knowing that your missed something when optimizing and asking specifically what is the right question. See http://agner.org/optimize/ and other links in the [x86 tag wiki](https://stackoverflow.com/tags/x86/info). — Peter Cordes, Sep 30 '17 at 23:37
Oh actually, I just looked at your numbers. modulo 2^32 happens literally for free; integer overflow wraps in x86, just like in a uint32_t in C. So you could run faster if you replaced your `div` with zero instructions instead of a shift! That might make this not *quite* a duplicate, let me know if you're unhappy with it being closed as a dup. — Peter Cordes, Sep 30 '17 at 23:43
As usual, to get pretty good fast code, ask a C compiler. gcc and clang both produce a loop with 4 instructions, with a critical path of just `imul` / `add` / `and`: https://godbolt.org/g/ndMzpo. So it will run at one iteration per 5 clocks on Intel Core2 or later, and AMD K8 or later (and probably earlier too). They use what @DavidWohlferd suggested, doing dec/jnz. (And for *this* case, there's no benefit to 64-bit. You only need a couple registers. gcc literally only uses 2, plus implicit RSP and RIP-relative addressing for the static LCG state.) — Peter Cordes, Sep 30 '17 at 23:57
Thank you for your help! After using 'dec edi ; jnz generate' instead of 'cmp', removing the 'div' completly and removing 'mov [seed], eax; mov eax,[seed]', the program now only needs ~1.58sec instead of former ~10sec. And with setting b to 0 and removing 'add eax,[b]' it only needs ~1.35sec thats more than 7x faster. The fact that you can trick around so much in Assembly astonishes me. The reason why I did not use 64bit was actually that I could not get printf to work and using PRINT_STRING for numbers was just pure horror. — The Kraken, Oct 01 '17 at 09:51
I would really like to upvote your comments, but it seems like my reputation is not high enough to do that:( — The Kraken, Oct 01 '17 at 10:01
A different function calling convention is used in 64-bit mode, with args in registers. See the [x86 tag wiki](https://stackoverflow.com/tags/x86/info) for the Windows and the non-Windows (System V) x86-64 ABI / calling convention docs to find out which registers, and the "shadow space" in the Windows convention. — Peter Cordes, Oct 01 '17 at 17:20
BTW, don't forget to include "@peter" to notify people of replies. I just happened to notice your reply while looking at this question again to see if anyone else commented. Anyway, glad you're having fun optimizing. :) And yeah, the store/reload of the seed inside the loop is about 3 to 5 cycles of latency, so getting rid of that is a major improvement. Keeping `m` in a register and using `imul eax, ecx` or something would work well, too. (Although with your loop bottlenecked on latency, `imul eax, [m]` would be just as fast.) That should be 1c lower latency than 32-bit `mul [m]`, BTW. — Peter Cordes, Oct 01 '17 at 17:25
For assembly code optimization I highly recommend keeping an eye out for register contents after each instruction in your debugger if numbers/flags do not reveal some shortcuts (write them after each instruction if it helps). If you ever write loop within loop in future, avoid spacings before instructions because you might miss optimizations that could not be represented in C/C++ without `goto` statement. — BalticMusicFan, Jan 21 '18 at 14:19

NASM x86 Assembly Optimization [Linear congruential generator]

0 Answers0