Conway's game of life becomes slower after using multi-threads

Question

I am a complete beginner in Rust and I am currently writing this parallel Conway's game of life. The code itself works fine but the problem is that when using multiple threads the program becomes slower (I measure the speed of the program by counting the time the glider moves from the top-left corner to the bottom-right corner). I did some experiments and it became slower and slower as the number of threads increases. I also have a Java version using almost the same algorithm; it works just fine. All I expect is that the Rust version can become at least slightly faster with threads more than one. Can anyone please point out where I did wrong? I am sorry if the code seems unreasonable, as I said I am a complete beginner :-).

main.rs reads the command line arguments and does the board update.

extern crate clap;
extern crate termion;
extern crate chrono;

use std::thread;
use std::sync::Arc;
use chrono::prelude::*;


mod board;
mod config;

use board::Board;
use config::Config;

fn main() {
    let dt1 = Local::now();

    let matches = clap::App::new("conway")
        .arg(clap::Arg::with_name("length")
            .long("length")
            .value_name("LENGTH")
            .help("Set length of the board")
            .takes_value(true))
        .arg(clap::Arg::with_name("threads")
            .long("threads")
            .value_name("THREADS")
            .help("How many threads to update the board")
            .takes_value(true))
        .arg(clap::Arg::with_name("display")
            .long("display")
            .value_name("DISPLAY")
            .help("Display the board or not")
            .takes_value(true))
        .arg(clap::Arg::with_name("delay")
            .long("delay")
            .value_name("MILLISECONDS")
            .help("Delay between the frames in milliseconds")
            .takes_value(true))
        .get_matches();

    let config = Config::from_matches(matches);
    let mut board = Board::new(config.length);
    let mut start: bool = false;
    let mut end: bool = false;
    let mut start_time: DateTime<Local> = Local::now();
    let mut end_time: DateTime<Local>;

    board.initialize_glider();
    loop {
        if config.display == 1 {
            print!("{}{}", termion::clear::All, termion::cursor::Goto(3, 3));
            board_render(&board);
        }

        if board.board[0][1] == 1 && !start {
            start_time = Local::now();
            start = true;
        }
        if board.board[config.length - 1][config.length - 1] == 1 && !end {
            end_time = Local::now();
            println!("{}", end_time - start_time);
            end = true;
        }

        board = board::Board::update(Arc::new(board), config.threads);
        thread::sleep(config.delay);
    }
}

fn board_render(board: &Board) {
    let mut output = String::with_capacity(board.n * (board.n + 1));
    for i in 0..board.n {
        for j in 0..board.n {
            let ch;
            if board.board[i][j] == 0 {
                ch = '░';
            } else {
                ch = '█';
            }
            output.push(ch);
        }
        output.push_str("\n  ");
    }
    print!("{}", output);
}

board.rs is where the algorithm for updating the board with multiple threads exists

use std::sync::{Mutex, Arc};
use std::thread;

pub struct Board {
    pub n: usize,
    pub board: Vec<Vec<i32>>,
}

impl Board {
    pub fn new(n: usize) -> Board {
        let board = vec![vec![0; n]; n];
        Board {
            n,
            board,
        }
    }

    pub fn update(Board: Arc<Self>, t_num: usize) -> Board {
        let new_board = Arc::new(Mutex::new(Board::new(Board.n)));
        let mut workers = Vec::with_capacity(t_num);

        let block_size = Board.n / t_num;
        let mut start = 0;

        for t in 0..t_num {
            let old_board = Board.clone();

            let new_board = Arc::clone(&new_board);
            let mut end = start + block_size;

            if t == t_num - 1 { end = old_board.n; }

            let worker = thread::spawn(move || {
                let mut board = new_board.lock().unwrap();
                for i in start..end {
                    for j in 0..old_board.n {
                        let im = (i + old_board.n - 1) % old_board.n;
                        let ip = (i + 1) % old_board.n;
                        let jm = (j + old_board.n - 1) % old_board.n;
                        let jp = (j + 1) % old_board.n;
                        let sum = old_board.board[im][jm] + old_board.board[im][j]
                            + old_board.board[im][jp] + old_board.board[i][jm] + old_board.board[i][jp]
                            + old_board.board[ip][jm] + old_board.board[ip][j] + old_board.board[ip][jp];

                        if sum == 2 {
                            board.board[i][j] = old_board.board[i][j];
                        } else if sum == 3 {
                            board.board[i][j] = 1;
                        } else {
                            board.board[i][j] = 0;
                        }
                    }
                }
            });

            workers.push(worker);
            start = start + block_size;
        }

        for worker in workers {
            worker.join().unwrap();
        }


        let result = new_board.lock().unwrap();
        let mut board = Board::new(Board.n);
        board.board = result.board.to_vec();

        board
    }

    pub fn initialize_glider(&mut self) -> &mut Board {
        self.board[0][1] = 1;
        self.board[1][2] = 1;
        self.board[2][0] = 1;
        self.board[2][1] = 1;
        self.board[2][2] = 1;

        self
    }
}

You are holding a lock for the whole `loop` so only one thread can perform work at a time in your `update()` method, thus your multi-threading is effectively single threaded. Yet you still pay the cost for the creation/destruction/synchronization of other threads — Svetlin Zarev, Dec 10 '19 at 04:59
It's best to try to create a [mre]. Your example code is not *reproducible* because it lacks the `config.rs` file, and it's not *minimal* because it contains a lot of extraneous code (argument parsing, rendering, etc.) that is not necessary to observe the bad behavior. You could replace the `loop` in `main` with just `for _ in 0..100 { }` and put nothing in it but the `update` call, for example, and define `const`ants for board size and number of threads, instead of parsing them from the command line. — trent, Dec 10 '19 at 12:03
This question was [cross-posted to users.rust-lang.org](https://users.rust-lang.org/t/why-my-conways-game-of-life-becomes-slower-after-using-multi-threads/35545). — trent, Dec 10 '19 at 20:11
Problem solved, I ended up using scoped thread with rayon. Thank you for your help. — Vic2333, Dec 11 '19 at 20:17

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

Each worker thread tries to lock the mutex immediately upon starting, and never releases the lock until it's done. Since only one thread can hold the mutex at a time, only one thread can do work at a time.

Here are two ways you might solve this problem:

Don't lock the mutex until you really, really need to. Create a scratch area inside the worker thread that represents the block you are updating. Fill the scratch area first. Then lock the mutex, copy the contents of the scratch area into the new_board, and return.

Using this method, most of the work can be done concurrently, but if all your workers finish at roughly the same time they will still have to take turns putting it all in new_board.
Don't use a lock at all: change the type of self.board to Vec<Vec<AtomicI32>> (std::sync::atomic::AtomicI32) and atomically update the board without having to acquire a lock.

This method may or may not slow down the process of updating, possibly depending on what memory orderings you use¹, but it eliminates contention for the lock.

Free-range advice

Don't call a variable Board. Convention, which the compiler alerts you of, is to give variables snake case names, but beyond that it is confusing because you also have a type named Board. I suggest actually just calling it self, which also lets you call update with method syntax.
Don't put the whole board in an Arc so you can pass it to update, and then make a new board which has to be put in a new Arc the next iteration. Either make update return an Arc itself, or have it take self and do all the Arc-wrangling inside it.
Better still, don't use Arc at all. Use a crate that provides scoped threads to pass your data to the worker threads by reference.
Allocator performance will generally be better with a few large allocations than with many small ones. Change the type of Board.board to Vec<i32> and use arithmetic to calculate the indexes (for instance, point i, j is at index j*n + i).
It's also better not to create and throw away allocations if you don't need to. Typical advice for cellular automata is to create two buffers that contain board states: the current state and the next state. When you're done creating the next state, just swap the buffers so the current state becomes the next state and vice versa.
i32 wastes space; you could use i8 or an enum, or possibly bool.

¹ I would suggest SeqCst unless you really know what you're doing. I suspect Relaxed is probably sufficient, but I don't really know what I'm doing.

I ended up using scoped threads with rayon, it works now. Thank you for your help! — Vic2333, Dec 11 '19 at 20:17

Conway's game of life becomes slower after using multi-threads

1 Answers1

Free-range advice