Sum of all the squares between 1 to 100 in assembly

Question

 mov eax,0
 mov ecx,100 

loop2_start: mov eax,ecx
             add eax,ecx 
             imul eax,eax  ;To act as n^2
             jle loop2_start

loop2_end: call print_int
           call print_nl

I'm trying to find sum of all the squares between 1 to 100. When I print this, it gives me 40000 when the answer should be 338350. What should I change to make it work? Perhaps my jump instruction?

Adding some comments about what you think each instruction does would be very helpful to someone trying to answer this question. Or even just a high level idea of the approach. I expected an `inc` or `dec` somewhere but I don't see it so I have no idea about the algorithm you're using — lakshayg, Oct 29 '20 at 01:20
Just a hint: 200 * 200 is 40,000. That's what your code does. — zx485, Oct 29 '20 at 01:28
Single-step this in a debugger. It just does `(100 + 100)^2` and falls through the JLE the first time, based on FLAGS set by `imul` (OF cleared, SF and ZF undefined (but apparently cleared for [`jle`](https://www.felixcloutier.com/x86/jcc) to be not-taken)). Obviously your loop condition isn't *supposed* to be based on squaring the total, and the mov / add / imul logic isn't right either. — Peter Cordes, Oct 29 '20 at 01:40
Write the logic in C, or pseudocode or whatever, including the loop logic, and translate that to asm. (With asm in mind, you can use a `do{}while()` loop structure). Your current code between `loop2_start` and `jle` is completely broken, throw it out and start over. — Peter Cordes, Oct 29 '20 at 01:58
See [Why are loops always compiled into "do...while" style (tail jump)?](https://stackoverflow.com/q/47783926) for standard asm loop structure — Peter Cordes, Oct 29 '20 at 02:13

Brendan · Answer 1 · 2020-10-29T05:34:53.730

Lets start with some maths.

For positive values of n, if:

lastSquare = n*n

Then:

nextSquare = (n+1)*(n+1)
           = (n * n) + (n * 1) + (1 * n) + 1
           = (n * n) + 2 * n + 1
           = lastSquare + 2 * n + 1

In other words, (using "C like" pseudo-code) you can do:

    square = 0;
    sum = 0;
    for(n = 1-1; n < 100-1; n++) {
        square = square + 2 * n + 1;
        sum = sum + square;
    }

For 80x86 assembly, square = square + 2 * n + 1 can be done in a single fast instruction - e.g. lea ebx,[ebx+ecx*2+1].

This gives something like (NASM syntax, untested):

; Loop preperation

    xor ebx,ebx           ;ebx = square = 0
    xor eax,eax           ;eax = sum = 0
    xor ecx,ecx           ;ecx = n = 1-1

; Loop

.next:
    lea ebx,[ebx+ecx*2+1] ;ebx = square + 2 * n + 1
    add eax,ebx           ;eax = sum + square

; Loop end

    inc ecx
    cmp ecx,100-1
    jb .next

However; the middle of the loop is small compared to the loop's overhead, so it'd be fun to unroll the loop a little to reduce the loop's overhead. 99 is odd, but is a multiple of 3, so:

; Loop preperation

    xor ebx,ebx           ;ebx = square = 0
    xor eax,eax           ;eax = sum = 0
    xor ecx,ecx           ;ecx = n = 1-1

; Loop

.next:
    lea ebx,[ebx+ecx*2+1]   ;ebx = square + 2 * n + 1
    add eax,ebx             ;eax = sum + square
    lea ebx,[ebx+ecx*2+2+1] ;ebx = square + 2 * (n+1) + 1
    add eax,ebx             ;eax = sum + square
    lea ebx,[ebx+ecx*2+4+1] ;ebx = square + 2 * (n+2) + 1
    add eax,ebx             ;eax = sum + square

; Loop end

    add ecx,3
    cmp ecx,100-1
    jb .next

This allows you to use maths to cheat more! Specifically:

nextSquare = (n+1)*(n+1)
           = (n * n) + (n * 1) + (1 * n) + 1
           = (n * n) + 2 * n + 1
           = lastSquare + 2 * n + 1

nextNextSquare = (n+2)*(n+2)
               = (n * n) + (n * 2) + (2 * n) + 2*2
               = (n * n) + 4 * n + 4
               = lastSquare + 4 * n + 4

nextNextNextSquare = (n+3)*(n+3)
                   = (n * n) + (n * 3) + (3 * n) + 3*3
                   = (n * n) + 6 * n + 9
                   = lastSquare + 6 * n + 9

nextAddend = nextSquare  + nextNextSquare + nextNextNextSquare
           = lastSquare + 2 * n + 1 + lastSquare + 4 * n + 4 + lastSquare + 6 * n + 9
           = 3 * lastSquare + 12 * n + 14
           = 3 * (lastSquare + 4 * n) + 14

This leads to something like (NASM syntax, untested):

; Loop preperation

    xor ebx,ebx           ;ebx = square = 0
    xor eax,eax           ;eax = sum = 0
    xor ecx,ecx           ;ecx = n = 1-1

; Loop

.next:
    lea edx,[ebx+ecx*4]      ;edx = square + 4 * n
    lea edx,[edx*2+edx+14]   ;edx = 3 * (square + 4 * n) + 14
    add eax,edx              ;eax = sum + nextSquare  + nextNextSquare + nextNextNextSquare
    lea edx,[ecx*2+ecx]      ;edx = 2*n
    lea ebx,[ebx+2*edx+9]    ;ebx = lastSquare + 6 * n + 9 = nextNextNextSquare


; Loop end

    add ecx,3
    cmp ecx,100-1
    jb .next

Of course by unrolling the loop more you get to cheat more; but if you keep doing that and optimizing you end up at "fully unrolled" (no more loop) where the code becomes a single "mov eax,precalculated_constant" instruction.

`lea ebx,[ebx+ecx*2+1]` is efficient for total uop count, but it has 3-cycle *latency* (on Intel CPUs because is has 3 "components", 2 pluses) so in this loop it would actually be faster to mov/imul/add, with only a 1-cycle loop carried dependency chain. Or `lea edx, [NOSPLIT ecx*2+1]` / `add ebx, edx` would be a very good compromise. — Peter Cordes, Oct 29 '20 at 05:46
Or we could do `lea ebx,[ebx+ecx*2]` inside the loop and `add ebx, 100` after the loop, but then we lose the running-total. When unrolling, you can roll the `+1 +3 +5` into one of the 3 LEAs to amortize that extra latency... (lea with `[ebx + ecx*2]` has extra latency on AMD because of the scaled index. And doing it this way, it *is* part of the loop-carried dependency chain. :/) — Peter Cordes, Oct 29 '20 at 05:48
If you really want to cheat, https://byjus.com/maths/sum-of-squares/ shows that there's a closed form `Σn^2 = [n(n+1)(2n+1)]/6`. For most `n` where the final result doesn't overflow, the intermediate result (before dividing by 6) doesn't overflow either. — Peter Cordes, Oct 29 '20 at 05:53
@PeterCordes: If you do a weighted sum for all 32-bit 80x86 CPUs (from 80386 until now, including AMD, VIA, Vortex, Cryix, etc), 3-component LEA is clearly better than MUL alone (sans add) for any sane set of weights. If you ignore everything except recent desktop/server CPUs from Intel (excluding Atom, Knight's Landing, etc) the MUL is still never better (e.g. both 3 cycles latency for Coffee Lake) and you'd still prefer 3-compoent LEA (using "assumed nano-joule cost" as tie-breaker). For splitting (e.g. LEA then ADD) the set of weights matters a lot more (e.g. bad on Zen2). — Brendan, Oct 29 '20 at 07:33
Weighting by current installed-base? I'm not so sure; there are a *lot* of Sandybridge-family CPUs, and a lot of the rest is Core2 / Nehalem / Zen: all of them have 1-cycle throughput imul. The key point is that the `imul` *latency* is off the critical path; it's independent work that "forks" off of a 1-cycle `inc` or `dec` chain, so throughput is what matters. You only get long loop-carried dependencies with this strength-reduction that calculates the next based on the previous, instead of just fully multiplying. `mov edx, ecx` / `imul edx,edx` / `add eax, edx` only the add is lat crit-path — Peter Cordes, Oct 29 '20 at 07:43
@PeterCordes: I mostly don't bother doing this (too hard to estimate a set of weights that accurately reflects end users, so I go with "gut feeling based on a model of discrete electronics"). However; if you think about the life-cycle of most software it's not hard to conclude that the CPUs with the largest weights should be CPUs that don't exist yet (that most users will be using 5 years after you've released the software and moved to a different project). The `mul`/`imul` latency is "off the critical path briefly" until your EU/s are saturated (but only on CPUs were you'd use SIMD instead). — Brendan, Oct 29 '20 at 07:58
When your EUs are saturated, it's `imul r,r` *throughput* that becomes part of the critical path, not its latency. Out-of-order exec hides the latency of the fully-pipelined multiplier producing results to feed an `add` dep chain. No currently-sold x86 CPUs do in-order execution, and all of them (including low-power ones) have good imul r32,r32 throughput. Even first-gen Silvermont has 1-cycle throughput for that. (And 3c latency). With wider OoO exec in future CPUs (Ice Lake), the "dumb" way with more ILP and lower loop-carried latency gains even more ground. — Peter Cordes, Oct 29 '20 at 08:09
In practice (if I wasn't just using the closed-form, e.g. if I needed to store the series but didn't want to use SIMD), I'd probably go with an LEA-based version, especially one that keeps the loop-carried latency down to just an `add` by calculating the stuff based on ECX separately from the stuff based on the previous total. Or maybe a lagged version that goes 2 steps, so we can run 2 dep chains, if the math works out nicely for that. Agreed it feels wasteful to use a multiply when you don't have to. — Peter Cordes, Oct 29 '20 at 08:13

Sum of all the squares between 1 to 100 in assembly

1 Answers1