Compiler optimization for sum of squared numbers

Question

Here is something that I find interesting:

pub fn sum_of_squares(n: i32) -> i32 {
    let mut sum = 0;
    for i in 1..n+1 {
        sum += i*i;
    }
    sum
}

This is the naive implementation of the sum of squared numbers in Rust. This is the assembly code with rustc 1.65.0 with -O3

        lea     ecx, [rdi + 1]
        xor     eax, eax
        cmp     ecx, 2
        jl      .LBB0_2
        lea     eax, [rdi - 1]
        lea     ecx, [rdi - 2]
        imul    rcx, rax
        lea     eax, [rdi - 3]
        imul    rax, rcx
        shr     rax
        imul    eax, eax, 1431655766
        shr     rcx
        lea     ecx, [rcx + 4*rcx]
        add     ecx, eax
        lea     eax, [rcx + 4*rdi]
        add     eax, -3
.LBB0_2:
        ret

I was expecting it to use the formula for the sum of squared numbers, but it does not. It uses a magical number 1431655766 which I don't understand at all.

Then I wanted to see what clang and gcc do in C++ for the same function

        test    edi, edi
        jle     .L8
        lea     eax, [rdi-1]
        cmp     eax, 17
        jbe     .L9
        mov     edx, edi
        movdqa  xmm3, XMMWORD PTR .LC0[rip]
        xor     eax, eax
        pxor    xmm1, xmm1
        movdqa  xmm4, XMMWORD PTR .LC1[rip]
        shr     edx, 2
.L4:
        movdqa  xmm0, xmm3
        add     eax, 1
        paddd   xmm3, xmm4
        movdqa  xmm2, xmm0
        pmuludq xmm2, xmm0
        psrlq   xmm0, 32
        pmuludq xmm0, xmm0
        pshufd  xmm2, xmm2, 8
        pshufd  xmm0, xmm0, 8
        punpckldq       xmm2, xmm0
        paddd   xmm1, xmm2
        cmp     eax, edx
        jne     .L4
        movdqa  xmm0, xmm1
        mov     eax, edi
        psrldq  xmm0, 8
        and     eax, -4
        paddd   xmm1, xmm0
        add     eax, 1
        movdqa  xmm0, xmm1
        psrldq  xmm0, 4
        paddd   xmm1, xmm0
        movd    edx, xmm1
        test    dil, 3
        je      .L1
.L7:
        mov     ecx, eax
        imul    ecx, eax
        add     eax, 1
        add     edx, ecx
        cmp     edi, eax
        jge     .L7
.L1:
        mov     eax, edx
        ret
.L8:
        xor     edx, edx
        mov     eax, edx
        ret
.L9:
        mov     eax, 1
        xor     edx, edx
        jmp     .L7
.LC0:
        .long   1
        .long   2
        .long   3
        .long   4
.LC1:
        .long   4
        .long   4
        .long   4
        .long   4

This is gcc 12.2 with -O3. GCC also does not use the sum of squared formula. I also don't know why it checks if the number is greater than 17? But for some reason, gcc does make a lot of operations compared to clang and rustc.

This is clang 15.0.0 with -O3

    test    edi, edi
    jle     .LBB0_1
    lea     eax, [rdi - 1]
    lea     ecx, [rdi - 2]
    imul    rcx, rax
    lea     eax, [rdi - 3]
    imul    rax, rcx
    shr     rax
    imul    eax, eax, 1431655766
    shr     rcx
    lea     ecx, [rcx + 4*rcx]
    add     ecx, eax
    lea     eax, [rcx + 4*rdi]
    add     eax, -3
    ret
.LBB0_1:
        xor     eax, eax
        ret

I don't really understand what kind of optimization clang is doing there. But rustc, clang, and gcc doesn't like n(n+1)(2n+1)/6

Then I timed their performance. Rust is doing significantly better than gcc and clang. These are the average results of 100 executions. Using 11th gen intel core i7-11800h @ 2.30 GHz

Rust: 0.2 microseconds
Clang: 3 microseconds
gcc: 5 microseconds

Can someone explain the performance difference?

Edit C++:

int sum_of_squares(int n){
    int sum = 0;
    for(int i = 1; i <= n; i++){
        sum += i*i;
    }
    return sum;
}

EDIT 2 For everyone wondering here is my benchmark code:

use std::time::Instant;
pub fn sum_of_squares(n: i32) -> i32 {
    let mut sum = 0;
    for i in 1..n+1 {
        sum += i*i;
    }
    sum
}

fn main() {
    let start = Instant::now();
    let result = sum_of_squares(1000);
    let elapsed = start.elapsed();

    println!("Result: {}", result);
    println!("Elapsed time: {:?}", elapsed);
}

And in C++:

#include <chrono>
#include <iostream>

int sum_of_squares(int n){
    int sum = 0;
    for(int i = 1; i <= n; i++){
        sum += i*i;
    }
    return sum;
}

int main() {
    auto start = std::chrono::high_resolution_clock::now();
    int result = sum_of_squares(1000);
    auto end = std::chrono::high_resolution_clock::now();

    std::cout << "Result: " << result << std::endl;
    std::cout << "Elapsed time: "
              << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count()
              << " microseconds" << std::endl;

    return 0;
}

@merovingian it not compiles though https://godbolt.org/z/bjdx6n6x6 — apple apple, Dec 27 '22 at 21:05
[Same code](https://godbolt.org/z/9sEqqb17e)? You posted code that isn't C++. — PaulMcKenzie, Dec 27 '22 at 21:05
I don't think rust assembly is correct for all `n`. It looks like it was optimized for one particular value. If you make C++ function `constexpr` it can be optimezed out completly. — sklott, Dec 27 '22 at 21:08
I got the same result for different and large N's in rust and C++. I don't think it's wrong @sklott — merovingian, Dec 27 '22 at 21:12
I understand but isn't the main issue assembly code? I didn't wanna post every single detail @PaulMcKenzie — merovingian, Dec 27 '22 at 21:13
@merovingian benchmark is hard, it's not unlikely you can do it wrong. so include the benchmark code is necessary. — apple apple, Dec 27 '22 at 21:17
Closely related: https://stackoverflow.com/questions/74417624/how-does-clang-generate-non-looping-code-for-sum-of-squares/74419103#74419103 — Nate Eldredge, Dec 27 '22 at 21:17
@PaulMcKenzie there is no need for a `main` function, I get similar code for just a library function see [compiler explorer](https://rust.godbolt.org/z/6nWocbzdr) — cafce25, Dec 27 '22 at 21:18
I might be wrong in benchmarking by some margin. But I am not wrong about the difference between assembly codes. — merovingian, Dec 27 '22 at 21:18
@merovingian so you're telling you can see the performance from assembly directly? Then add those to the question and ask what you're in doubt. — apple apple, Dec 27 '22 at 21:19
The performance is just one observation, not what the question is about if I read it correctly. — cafce25, Dec 27 '22 at 21:20
It's obvious from the assembly that the C++ versions have loops and the rust version doesn't! — TrentP, Dec 27 '22 at 21:20
You've got `i64` in the benchmark version of the Rust function. I'm not sure what Rust's rules on integer overflow are, but it matters for C++, where unsigned overflow is precisely defined. So the compiler must generate the "correct" value for unsigned overflows, even if the code isn't meant to work correctly when that happens. — TrentP, Dec 27 '22 at 21:26
So it seems to me like this really is an exact duplicate of https://stackoverflow.com/questions/74417624/how-does-clang-generate-non-looping-code-for-sum-of-squares. Do you want this question closed? Or are there other specific things you want to know? — Nate Eldredge, Dec 27 '22 at 21:26
Because I was changing the value of n all the time. I should probably change int to long long int — merovingian, Dec 27 '22 at 21:28
It actually doesn't answer my question. It only explains why clang does what it does @NateEldredge — merovingian, Dec 27 '22 at 21:31
So the only question stated in your post is "Can someone explain the performance difference?" Is that what you are asking? First of all, were you also using the unoptimized code for clang? I bet if you rerun the test after fixing that, you will find very similar results for rust and clang. (But 100 reps is way too few for meaningful timing, make it 10 million or something. And make sure the return value is used in some essential way or the computation may get optimized out completely.) — Nate Eldredge, Dec 27 '22 at 21:35
So once you are seeing similar results between rust and clang, the remaining question is why gcc is slower, and the answer is in front of you: it doesn't do the algebra optimization, and actually executes a loop to compute and sum all the squares. — Nate Eldredge, Dec 27 '22 at 21:36

Iwa · Answer 1 · 2022-12-27T21:48:15.080

8

I was expecting it to use the formula for the sum of squared numbers, but it does not. It uses a magical number 1431655766 which I don't understand at all.

LLVM does transform that loop into a formula but its different from naive square sum formula.

This article explains formula and generated code better than I could.

edited Dec 27 '22 at 21:48

answered Dec 27 '22 at 21:42

Iwa

512
1
5
18

Jérôme Richard · Answer 2 · 2022-12-27T22:19:36.877

Clang does the same optimization using -O3 in C++ but not GCC yet. See on GodBolt. AFAIK, the default Rust compiler use LLVM internally like Clang. This is why they produce a similar code. GCC use a naive loop vectorized using SIMD instructions while Clang uses a formula like the one you gave in the question.

The optimized assembly code from the C++ code is the following:

sum_of_squares(int):                    # @sum_of_squares(int)
        test    edi, edi
        jle     .LBB0_1
        lea     eax, [rdi - 1]
        lea     ecx, [rdi - 2]
        imul    rcx, rax
        lea     eax, [rdi - 3]
        imul    rax, rcx
        shr     rax
        imul    eax, eax, 1431655766
        shr     rcx
        lea     ecx, [rcx + 4*rcx]
        add     ecx, eax
        lea     eax, [rcx + 4*rdi]
        add     eax, -3
        ret
.LBB0_1:
        xor     eax, eax
        ret

This optimization mainly comes from the IndVarSimplify optimization pass. On can see that some variables are encoded on 32 bits while some others are encoded on 33 bits (requiring a 64 bit register on mainstream platforms). The code basically does:

if(edi == 0)
    return 0;
eax = rdi - 1;
ecx = rdi - 2;
rcx *= rax;
eax = rdi - 3;
rax *= rcx;
rax >>= 1;
eax *= 1431655766;
rcx >>= 1;
ecx = rcx + 4*rcx;
ecx += eax;
eax = rcx + 4*rdi;
eax -= 3;
return eax;

This can be further simplified to the following equivalent C++ code:

if(n == 0)
    return 0;
int64_t m = n;
int64_t tmp = ((m - 3) * (m - 1) * (m - 2)) / 2;
tmp = int32_t(int32_t(tmp) * int32_t(1431655766));
return 5 * ((m - 1) * (m - 2) / 2) + tmp + (4*m - 3);

Note that some casts and overflows are ignored for sake of clarity.

The magical number 1431655766 comes from a kind of correction from an overflow related to a division by 3. Indeed, 1431655766 / 2**32 ~= 0.33333333348855376. Clang plays with the 32-bit overflows so to generate a fast implementation of the formula n(n+1)(2n+1)/6.

Very good explanation thanks a lot. One more thing, do you know why GCC checks if the number is greater than 17? — merovingian, Dec 28 '22 at 10:00

score 4 · Answer 3 · answered Dec 27 '22 at 21:21

4

Division by a constant c on a machine with 128 bit product is often implemented by multiplying by 2^64 / c. That’s where your strange constant comes from.

Now the formula n(n+1)(2n+1) / 6 will overflow for large n, while the sum won’t, so this formula can only be used very, very carefully.

answered Dec 27 '22 at 21:21

gnasher729

51,477
5
75
98

3

There is no 128-bit product (nor 64-bit) in the code though, it's only a 32-bit product (like multiplying two `uint32_t`s by each other), the "upper half" of the product is not produced. So it shouldn't be a multiplication by a fixed-point reciprocal. But 1431655766 also has the property that 1431655766 * 3 = 2 (mod 2^32) – harold Dec 27 '22 at 21:35

Compiler optimization for sum of squared numbers

3 Answers3