0

I want to count the frequency of words within a large string.

The simple single threaded solution looks like this

use hashbrown::HashMap;

fn main() {
    let buffer = String::from("Hello World Hello Rust");
    let mut frequency: HashMap<&str, u32> = HashMap::new();

    for word in buffer.split_whitespace() {
        *frequency.entry(word).or_insert(0) += 1;
    }
}

Then I tried to add some multi threading capabilities and ended up with the code below:

extern crate crossbeam;

use hashbrown::HashMap;
use std::sync::{Arc, Mutex};

fn main() {
    let buffer = Arc::new(String::from("Hello World Hello Rust"));
    let frequency: Arc<Mutex<HashMap<&str, u32>>> = Arc::new(Mutex::new(HashMap::new()));

    crossbeam::scope(|scope| {
        for _ in 0..1 {
            let buffer = buffer.clone();
            let frequency = frequency.clone();

            scope.spawn(move |_| {
                for word in buffer.split_whitespace() {
                    let mut frequency = frequency.lock().unwrap();
                    *frequency.entry(word).or_insert(0) += 1;
                }
            });
        }
    });
}

The compiler fails with the following message:

error[E0597]: `buffer` does not live long enough
  --> src/main.rs:16:29
   |
13 |             let frequency = frequency.clone();
   |                 --------- lifetime `'1` appears in the type of `frequency`
...
16 |                 for word in buffer.split_whitespace() {
   |                             ^^^^^^ borrowed value does not live long enough
17 |                     let mut frequency = frequency.lock().unwrap();
18 |                     *frequency.entry(word).or_insert(0) += 1;
   |                      --------------------- argument requires that `buffer` is borrowed for `'1`
19 |                 }
20 |             });
   |             - `buffer` dropped here while still borrowed

To simplify the code I removed string chunking and only spawn 1 thread within the for loop.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Oliver
  • 978
  • 6
  • 17
  • 1
    Why is your input string in an `Arc` if you are using crossbeam? Why are you going out of your way to use hashbrown? – Shepmaster Mar 28 '19 at 11:55
  • The problem here is that parallelization doesn't help you at all, because a thread needs to own exclusivly the string. You can't operate with multiple threads on the same string at the same time. You have to split it beforehand. Your examples isn't very well thought. – hellow Mar 28 '19 at 11:56

1 Answers1

2

Scoped threads allow you to borrow something outside the thread and use it inside the thread. They cannot allow you to do the opposite (borrow something inside the thread and let it escape).

buffer.split_whitespace() borrows buffer, which has been moved into the inner closure and is therefore owned by the current thread. Each word is a reference with a lifetime dependent on buffer, which will go out of scope when the thread exits. (The underlying String isn't destroyed, but word can only borrow from the Arc, which has a shorter lifetime. The same thing would be true if you just cloned a String).

Arc and scoped threads are somewhat at odds. Arc is used when you're sharing a thing between threads and you want the thing to be destroyed whenever the last thread exits. You don't generally know or care which thread is the one to destroy it, only that it gets destroyed. On the other hand, scoped threads are used when you do know where a thing should be destroyed, and all the threads that want access to it must definitely exit before that. Because the lifetimes are statically verified with scoped threads, you can use ordinary & references instead of Arc. This goes both for the String and the Mutex.

So let's apply this:

let buffer = String::from("Hello World Hello Rust");
let frequency: Mutex<HashMap<&str, u32>> = Mutex::new(HashMap::new());

crossbeam::scope(|scope| {
    for _ in 0..1 {
        scope.spawn(|_| {
            for word in buffer.split_whitespace() {
                let mut frequency = frequency.lock().unwrap();
                *frequency.entry(word).or_insert(0) += 1;
            }
        });
    }
});

Oh, that was easy. Note there are no moves, no Arcs and no clone()s, and frequency will contain references to buffer, which is presumably what you wanted. For this to work, whatever string chunking approach you use must also borrow from the original str; you can't have a separate String for each thread.

Caveat

I'm unsure exactly how similar your example is meant to be to the original code. The above solution fixes the compile problem, but as Shepmaster points out:

I'd add that the original algorithm is not very efficient as the amount of contention for the HashMap is going to be extreme. It would likely be much more efficient for each thread to have its own HashMap and merge them at the end. [...] Something like this

See also

trent
  • 25,033
  • 7
  • 51
  • 90
  • 1
    I'd add that the original algorithm is not very efficient as the amount of contention for the `HashMap` is going to be extreme. It would likely be much more efficient for each thread to have its own `HashMap` and merge them at the end. See also [Is it possible to share a HashMap between threads without locking the entire HashMap?](https://stackoverflow.com/q/50282619/155423) – Shepmaster Mar 28 '19 at 14:18
  • 3
    [Something like this](https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=68e7f0214ca9bcfb90ba4f291e9f1f9b) – Shepmaster Mar 28 '19 at 14:30