0
 mov eax,0
 mov ecx,100 

loop2_start: mov eax,ecx
             add eax,ecx 
             imul eax,eax  ;To act as n^2
             jle loop2_start

loop2_end: call print_int
           call print_nl

I'm trying to find sum of all the squares between 1 to 100. When I print this, it gives me 40000 when the answer should be 338350. What should I change to make it work? Perhaps my jump instruction?

  • 1
    Adding some comments about what you think each instruction does would be very helpful to someone trying to answer this question. Or even just a high level idea of the approach. I expected an `inc` or `dec` somewhere but I don't see it so I have no idea about the algorithm you're using – lakshayg Oct 29 '20 at 01:20
  • Just a hint: 200 * 200 is 40,000. That's what your code does. – zx485 Oct 29 '20 at 01:28
  • How would I use the `cmp`? – user14358127 Oct 29 '20 at 01:35
  • Single-step this in a debugger. It just does `(100 + 100)^2` and falls through the JLE the first time, based on FLAGS set by `imul` (OF cleared, SF and ZF undefined (but apparently cleared for [`jle`](https://www.felixcloutier.com/x86/jcc) to be not-taken)). Obviously your loop condition isn't *supposed* to be based on squaring the total, and the mov / add / imul logic isn't right either. – Peter Cordes Oct 29 '20 at 01:40
  • Any advice on where to start fixing? – user14358127 Oct 29 '20 at 01:50
  • 1
    Write the logic in C, or pseudocode or whatever, including the loop logic, and translate that to asm. (With asm in mind, you can use a `do{}while()` loop structure). Your current code between `loop2_start` and `jle` is completely broken, throw it out and start over. – Peter Cordes Oct 29 '20 at 01:58
  • See [Why are loops always compiled into "do...while" style (tail jump)?](https://stackoverflow.com/q/47783926) for standard asm loop structure – Peter Cordes Oct 29 '20 at 02:13

1 Answers1

2

Lets start with some maths.

For positive values of n, if:

lastSquare = n*n

Then:

nextSquare = (n+1)*(n+1)
           = (n * n) + (n * 1) + (1 * n) + 1
           = (n * n) + 2 * n + 1
           = lastSquare + 2 * n + 1

In other words, (using "C like" pseudo-code) you can do:

    square = 0;
    sum = 0;
    for(n = 1-1; n < 100-1; n++) {
        square = square + 2 * n + 1;
        sum = sum + square;
    }

For 80x86 assembly, square = square + 2 * n + 1 can be done in a single fast instruction - e.g. lea ebx,[ebx+ecx*2+1].

This gives something like (NASM syntax, untested):

; Loop preperation

    xor ebx,ebx           ;ebx = square = 0
    xor eax,eax           ;eax = sum = 0
    xor ecx,ecx           ;ecx = n = 1-1

; Loop

.next:
    lea ebx,[ebx+ecx*2+1] ;ebx = square + 2 * n + 1
    add eax,ebx           ;eax = sum + square

; Loop end

    inc ecx
    cmp ecx,100-1
    jb .next

However; the middle of the loop is small compared to the loop's overhead, so it'd be fun to unroll the loop a little to reduce the loop's overhead. 99 is odd, but is a multiple of 3, so:

; Loop preperation

    xor ebx,ebx           ;ebx = square = 0
    xor eax,eax           ;eax = sum = 0
    xor ecx,ecx           ;ecx = n = 1-1

; Loop

.next:
    lea ebx,[ebx+ecx*2+1]   ;ebx = square + 2 * n + 1
    add eax,ebx             ;eax = sum + square
    lea ebx,[ebx+ecx*2+2+1] ;ebx = square + 2 * (n+1) + 1
    add eax,ebx             ;eax = sum + square
    lea ebx,[ebx+ecx*2+4+1] ;ebx = square + 2 * (n+2) + 1
    add eax,ebx             ;eax = sum + square

; Loop end

    add ecx,3
    cmp ecx,100-1
    jb .next

This allows you to use maths to cheat more! Specifically:

nextSquare = (n+1)*(n+1)
           = (n * n) + (n * 1) + (1 * n) + 1
           = (n * n) + 2 * n + 1
           = lastSquare + 2 * n + 1

nextNextSquare = (n+2)*(n+2)
               = (n * n) + (n * 2) + (2 * n) + 2*2
               = (n * n) + 4 * n + 4
               = lastSquare + 4 * n + 4

nextNextNextSquare = (n+3)*(n+3)
                   = (n * n) + (n * 3) + (3 * n) + 3*3
                   = (n * n) + 6 * n + 9
                   = lastSquare + 6 * n + 9

nextAddend = nextSquare  + nextNextSquare + nextNextNextSquare
           = lastSquare + 2 * n + 1 + lastSquare + 4 * n + 4 + lastSquare + 6 * n + 9
           = 3 * lastSquare + 12 * n + 14
           = 3 * (lastSquare + 4 * n) + 14

This leads to something like (NASM syntax, untested):

; Loop preperation

    xor ebx,ebx           ;ebx = square = 0
    xor eax,eax           ;eax = sum = 0
    xor ecx,ecx           ;ecx = n = 1-1

; Loop

.next:
    lea edx,[ebx+ecx*4]      ;edx = square + 4 * n
    lea edx,[edx*2+edx+14]   ;edx = 3 * (square + 4 * n) + 14
    add eax,edx              ;eax = sum + nextSquare  + nextNextSquare + nextNextNextSquare
    lea edx,[ecx*2+ecx]      ;edx = 2*n
    lea ebx,[ebx+2*edx+9]    ;ebx = lastSquare + 6 * n + 9 = nextNextNextSquare


; Loop end

    add ecx,3
    cmp ecx,100-1
    jb .next

Of course by unrolling the loop more you get to cheat more; but if you keep doing that and optimizing you end up at "fully unrolled" (no more loop) where the code becomes a single "mov eax,precalculated_constant" instruction.

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • `lea ebx,[ebx+ecx*2+1]` is efficient for total uop count, but it has 3-cycle *latency* (on Intel CPUs because is has 3 "components", 2 pluses) so in this loop it would actually be faster to mov/imul/add, with only a 1-cycle loop carried dependency chain. Or `lea edx, [NOSPLIT ecx*2+1]` / `add ebx, edx` would be a very good compromise. – Peter Cordes Oct 29 '20 at 05:46
  • Or we could do `lea ebx,[ebx+ecx*2]` inside the loop and `add ebx, 100` after the loop, but then we lose the running-total. When unrolling, you can roll the `+1 +3 +5` into one of the 3 LEAs to amortize that extra latency... (lea with `[ebx + ecx*2]` has extra latency on AMD because of the scaled index. And doing it this way, it *is* part of the loop-carried dependency chain. :/) – Peter Cordes Oct 29 '20 at 05:48
  • If you really want to cheat, https://byjus.com/maths/sum-of-squares/ shows that there's a closed form `Σn^2 = [n(n+1)(2n+1)]/6`. For most `n` where the final result doesn't overflow, the intermediate result (before dividing by 6) doesn't overflow either. – Peter Cordes Oct 29 '20 at 05:53
  • @PeterCordes: If you do a weighted sum for all 32-bit 80x86 CPUs (from 80386 until now, including AMD, VIA, Vortex, Cryix, etc), 3-component LEA is clearly better than MUL alone (sans add) for any sane set of weights. If you ignore everything except recent desktop/server CPUs from Intel (excluding Atom, Knight's Landing, etc) the MUL is still never better (e.g. both 3 cycles latency for Coffee Lake) and you'd still prefer 3-compoent LEA (using "assumed nano-joule cost" as tie-breaker). For splitting (e.g. LEA then ADD) the set of weights matters a lot more (e.g. bad on Zen2). – Brendan Oct 29 '20 at 07:33
  • Weighting by current installed-base? I'm not so sure; there are a *lot* of Sandybridge-family CPUs, and a lot of the rest is Core2 / Nehalem / Zen: all of them have 1-cycle throughput imul. The key point is that the `imul` *latency* is off the critical path; it's independent work that "forks" off of a 1-cycle `inc` or `dec` chain, so throughput is what matters. You only get long loop-carried dependencies with this strength-reduction that calculates the next based on the previous, instead of just fully multiplying. `mov edx, ecx` / `imul edx,edx` / `add eax, edx` only the add is lat crit-path – Peter Cordes Oct 29 '20 at 07:43
  • @PeterCordes: I mostly don't bother doing this (too hard to estimate a set of weights that accurately reflects end users, so I go with "gut feeling based on a model of discrete electronics"). However; if you think about the life-cycle of most software it's not hard to conclude that the CPUs with the largest weights should be CPUs that don't exist yet (that most users will be using 5 years after you've released the software and moved to a different project). The `mul`/`imul` latency is "off the critical path briefly" until your EU/s are saturated (but only on CPUs were you'd use SIMD instead). – Brendan Oct 29 '20 at 07:58
  • When your EUs are saturated, it's `imul r,r` *throughput* that becomes part of the critical path, not its latency. Out-of-order exec hides the latency of the fully-pipelined multiplier producing results to feed an `add` dep chain. No currently-sold x86 CPUs do in-order execution, and all of them (including low-power ones) have good imul r32,r32 throughput. Even first-gen Silvermont has 1-cycle throughput for that. (And 3c latency). With wider OoO exec in future CPUs (Ice Lake), the "dumb" way with more ILP and lower loop-carried latency gains even more ground. – Peter Cordes Oct 29 '20 at 08:09
  • In practice (if I wasn't just using the closed-form, e.g. if I needed to store the series but didn't want to use SIMD), I'd probably go with an LEA-based version, especially one that keeps the loop-carried latency down to just an `add` by calculating the stuff based on ECX separately from the stuff based on the previous total. Or maybe a lagged version that goes 2 steps, so we can run 2 dep chains, if the math works out nicely for that. Agreed it feels wasteful to use a multiply when you don't have to. – Peter Cordes Oct 29 '20 at 08:13