Rust code is extremely slow. Tips on speeding up

Question

I am facing some challenges with optimizing my Rust code for a standard inertia weight global best particle swarm algorithm, specifically for the well-known sphere benchmark function. As a newcomer to Rust, I understand that my code may not be the most efficient. However, I have hit a wall with speeding it up without resorting to parallelism. It's worth noting that I cannot assume the dimension and swarm size at compile time, hence my use of vectors. I am aware that my code utilizes clone in a hot loop, but I am at a loss for alternatives. Could someone kindly offer pointers on how to increase the speed of my code? I would greatly appreciate any tips or advice. It is currently running 3x slower than an equivalent program I wrote in Python.

Thank you!

Here is the code:

use std::fmt::Display;

use rand::Rng;

struct ObjectiveFunctionStruct {
    name: String,
    function: fn(&Vec<f64>) -> f64,
    lower_bound: Vec<f64>,
    upper_bound: Vec<f64>,
}

struct Particle {
    position: Vec<f64>,
    velocity: Vec<f64>,
    personal_best_position: Vec<f64>,
    personal_best_fitness: f64,
}

impl Display for Particle {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        write!(f, "(Position: {:?}, Velocity: {:?}, Personal Best Position: {:?}, Personal Best Fitness: {})", self.position, self.velocity, self.personal_best_position, self.personal_best_fitness)
    }
}

struct Swarm {
    w: f64,
    c1: f64,
    c2: f64,
    particles: Vec<Particle>,
    global_best_position: Vec<f64>,
    global_best_fitness: f64,
    objective_function_struct: ObjectiveFunctionStruct,
    rng_thread: rand::rngs::ThreadRng,
}

impl Swarm {
    fn new(w: f64, c1: f64, c2: f64, swarm_size: usize, objective_function_struct: ObjectiveFunctionStruct) -> Swarm {
        let dimension: usize = objective_function_struct.lower_bound.len();
        let mut particles: Vec<Particle> = Vec::new();
        let mut rng_thread: rand::rngs::ThreadRng = rand::thread_rng();
        let mut global_best_position: Vec<f64> = vec![0.0; dimension];
        let mut global_best_fitness: f64 = std::f64::MAX;
        for _ in 0..swarm_size {
            let mut particle: Particle = Particle {
                position: (0..dimension).map(|i| rng_thread.gen_range(objective_function_struct.lower_bound[i]..objective_function_struct.upper_bound[i])).collect(),
                velocity: vec![0.0; dimension],
                personal_best_position: vec![0.0; dimension],
                personal_best_fitness: std::f64::MAX,
            };
            particle.personal_best_position = particle.position.clone();
            particle.personal_best_fitness = (objective_function_struct.function)(&particle.position);
            if particle.personal_best_fitness < global_best_fitness {
                global_best_fitness = particle.personal_best_fitness;
                global_best_position = particle.personal_best_position.clone();
            }
            particles.push(particle);
        }
        Swarm {
            w,
            c1,
            c2,
            particles,
            global_best_position,
            global_best_fitness,
            objective_function_struct,
            rng_thread,
        }
    }

    fn update_particles(&mut self) {
        for particle in &mut self.particles {
            let dimension: usize = particle.position.len();
            for i in 0..dimension {
                particle.velocity[i] = self.w * particle.velocity[i] + self.c1 * self.rng_thread.gen_range(0.0..1.0) * (particle.personal_best_position[i] - particle.position[i]) + self.c2 * self.rng_thread.gen_range(0.0..1.0) * (self.global_best_position[i] - particle.position[i]);
                particle.position[i] += particle.velocity[i];
            }
            let fitness: f64 = (self.objective_function_struct.function)(&particle.position);
            if fitness < self.global_best_fitness {
                particle.personal_best_fitness = fitness;
                particle.personal_best_position = particle.position.clone();
                self.global_best_fitness = fitness;
                self.global_best_position = particle.position.clone();
            } else if fitness < particle.personal_best_fitness {
                particle.personal_best_fitness = fitness;
                particle.personal_best_position = particle.position.clone();
            }
        }
    }


    fn run(&mut self, iterations: usize) {
        for _ in 0..iterations {
            self.update_particles();
        }
    }

    fn print(&self) {
        println!("Global Best Position: {:?}", self.global_best_position);
        println!("Global Best Fitness: {}", self.global_best_fitness);
    }
}

fn sphere_function(x: &Vec<f64>) -> f64 {
    x.iter().map(|a: &f64| a.powi(2)).sum()
}

fn main() {
    use std::time::Instant;
    let now = Instant::now();
    let dim = 100;
    let objective_function_struct: ObjectiveFunctionStruct = ObjectiveFunctionStruct {
        name: "Sphere Function".to_string(),
        function: sphere_function,
        lower_bound: vec![-5.12; dim],
        upper_bound: vec![5.12; dim],
    };
    let mut swarm: Swarm = Swarm::new(0.729, 1.49445, 1.49445, 1000, objective_function_struct);
    swarm.run(10000);
    swarm.print();
    let elapsed = now.elapsed();
    println!("Elapsed: {} ms", elapsed.as_millis());
}

That’s a lot of code in a single block. Could you try to break it apart and explain where the actual working code is? — Fogmeister, Apr 30 '23 at 06:13
The code all works. It does what it is supposed to do, but it is slow. — , Apr 30 '23 at 06:14
I get that. But it’s just a lot of code to read through to try and find the bit that might be slow. It would be good to perhaps pull the “update” function out into a separate code block as that appears to possibly be the bit that is the most computationally intense. — Fogmeister, Apr 30 '23 at 06:16
Did you by chance compile in debug (the default) rather than release mode (`cargo build --release`)? — cafce25, Apr 30 '23 at 06:18
If you remove the call to `update_particles` does it run at a fast speed? Is the `update_particles` function the bit that is causing the issue? — Fogmeister, Apr 30 '23 at 06:19
Also see [Why is it discouraged to accept a reference &String, &Vec, or &Box as a function argument?](https://stackoverflow.com/questions/40006219/why-is-it-discouraged-to-accept-a-reference-string-vec-or-box-as-a-function) — cafce25, Apr 30 '23 at 06:20
@Fogmeister I am certain it will run significanty faster, because the `update_particles` function is the only function that is called in the `run` loop. `update_particles` contains all of the necessary logic for the optimization problem. However, I am unsure whether `update_particles` is necessarily the problem, or rather datatypes or other code that I am using which is causing the code to slow down. — , Apr 30 '23 at 06:21
`cargo flamegraph` shows that more than half of the time is spent in `gen_range` so you're likely testing the speed of random number generators, the default in `rand` isn't particularly optimized for speed but for high quality random numbers instead. But you did not include the python version so all of this is just speculation. — cafce25, Apr 30 '23 at 06:38
Swapping in SmallRng does result in a ~5x performance improvement. The undisclosed Python version likely uses the default rng, which is a mersenne twister, a non-CS PRNG. — Masklinn, Apr 30 '23 at 08:10
@Masklinn Yes they do. A PSO is a type of metaheuristic. Please read up about them before questioning the validity of the algorithm. Here is a link to a textbook on them: https://www.amazon.co.uk/Fundamentals-Computational-Intelligence-Andries-Engelbrecht/dp/0470091916 — , Apr 30 '23 at 09:15
You seem to be rather confused, this is not the "answers" section it's the "comments" section where it's possible to ask for precision or wonder about possible issues, as you did not provide the reference python code it was not out of the question that there would be transcription mistakes and the conversion is flawed, as the terms are used in a somewhat unusual manner and 2D/3D simulation questions are extremely common here, much more so than n-dimensional stuff. — Masklinn, Apr 30 '23 at 10:56

score 2 · Accepted Answer · answered Apr 30 '23 at 07:16

While optimizing global fitness, you clone vectors on each iteration. I would take a reference, then clone it once after the loop.
You may need random number generator tuning, as the default implementation may be secure but slow.
Data locality is significant. Lines of Vec<f64> are sparse in memory, so the CPU invalidates its L1/L2/L3 caches too often. You may allocate a single contiguous array for all particles. Also, switching from f64 to f32 serves the same purpose.
Parallelize computations either with SIMD or rayon.
Follow guidelines from the Rust Performance Book.

Rust code is extremely slow. Tips on speeding up

1 Answers1