Here is something that I find interesting:
pub fn sum_of_squares(n: i32) -> i32 {
let mut sum = 0;
for i in 1..n+1 {
sum += i*i;
}
sum
}
This is the naive implementation of the sum of squared numbers in Rust. This is the assembly code with rustc 1.65.0
with -O3
lea ecx, [rdi + 1]
xor eax, eax
cmp ecx, 2
jl .LBB0_2
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
lea eax, [rdi - 3]
imul rax, rcx
shr rax
imul eax, eax, 1431655766
shr rcx
lea ecx, [rcx + 4*rcx]
add ecx, eax
lea eax, [rcx + 4*rdi]
add eax, -3
.LBB0_2:
ret
I was expecting it to use the formula for the sum of squared numbers, but it does not. It uses a magical number 1431655766
which I don't understand at all.
Then I wanted to see what clang and gcc do in C++ for the same function
test edi, edi
jle .L8
lea eax, [rdi-1]
cmp eax, 17
jbe .L9
mov edx, edi
movdqa xmm3, XMMWORD PTR .LC0[rip]
xor eax, eax
pxor xmm1, xmm1
movdqa xmm4, XMMWORD PTR .LC1[rip]
shr edx, 2
.L4:
movdqa xmm0, xmm3
add eax, 1
paddd xmm3, xmm4
movdqa xmm2, xmm0
pmuludq xmm2, xmm0
psrlq xmm0, 32
pmuludq xmm0, xmm0
pshufd xmm2, xmm2, 8
pshufd xmm0, xmm0, 8
punpckldq xmm2, xmm0
paddd xmm1, xmm2
cmp eax, edx
jne .L4
movdqa xmm0, xmm1
mov eax, edi
psrldq xmm0, 8
and eax, -4
paddd xmm1, xmm0
add eax, 1
movdqa xmm0, xmm1
psrldq xmm0, 4
paddd xmm1, xmm0
movd edx, xmm1
test dil, 3
je .L1
.L7:
mov ecx, eax
imul ecx, eax
add eax, 1
add edx, ecx
cmp edi, eax
jge .L7
.L1:
mov eax, edx
ret
.L8:
xor edx, edx
mov eax, edx
ret
.L9:
mov eax, 1
xor edx, edx
jmp .L7
.LC0:
.long 1
.long 2
.long 3
.long 4
.LC1:
.long 4
.long 4
.long 4
.long 4
This is gcc 12.2
with -O3
. GCC also does not use the sum of squared formula. I also don't know why it checks if the number is greater than 17? But for some reason, gcc does make a lot of operations compared to clang and rustc.
This is clang 15.0.0
with -O3
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
lea eax, [rdi - 3]
imul rax, rcx
shr rax
imul eax, eax, 1431655766
shr rcx
lea ecx, [rcx + 4*rcx]
add ecx, eax
lea eax, [rcx + 4*rdi]
add eax, -3
ret
.LBB0_1:
xor eax, eax
ret
I don't really understand what kind of optimization clang is doing there. But rustc, clang, and gcc doesn't like n(n+1)(2n+1)/6
Then I timed their performance. Rust is doing significantly better than gcc and clang. These are the average results of 100 executions. Using 11th gen intel core i7-11800h @ 2.30 GHz
Rust: 0.2 microseconds
Clang: 3 microseconds
gcc: 5 microseconds
Can someone explain the performance difference?
Edit C++:
int sum_of_squares(int n){
int sum = 0;
for(int i = 1; i <= n; i++){
sum += i*i;
}
return sum;
}
EDIT 2 For everyone wondering here is my benchmark code:
use std::time::Instant;
pub fn sum_of_squares(n: i32) -> i32 {
let mut sum = 0;
for i in 1..n+1 {
sum += i*i;
}
sum
}
fn main() {
let start = Instant::now();
let result = sum_of_squares(1000);
let elapsed = start.elapsed();
println!("Result: {}", result);
println!("Elapsed time: {:?}", elapsed);
}
And in C++:
#include <chrono>
#include <iostream>
int sum_of_squares(int n){
int sum = 0;
for(int i = 1; i <= n; i++){
sum += i*i;
}
return sum;
}
int main() {
auto start = std::chrono::high_resolution_clock::now();
int result = sum_of_squares(1000);
auto end = std::chrono::high_resolution_clock::now();
std::cout << "Result: " << result << std::endl;
std::cout << "Elapsed time: "
<< std::chrono::duration_cast<std::chrono::microseconds>(end - start).count()
<< " microseconds" << std::endl;
return 0;
}