3

I've got a piece of code which is supposed to check if two sentences are "too similar", as defined by a heuristic made clearest by the code.

fn too_similar(thing1: &String, thing2: &String) -> bool {
    let split1 = thing1.split_whitespace();
    let split2 = thing2.split_whitespace();

    let mut matches = 0;
    for s1 in split1 {
        for s2 in split2 {
            if s1.eq(s2) {
                matches = matches + 1;
                break;
            }
        }
    }

    let longer_length =
        if thing1.len() > thing2.len() {
            thing1.len()
        } else {
            thing2.len()
        };

    matches > longer_length / 2
}

However, I'm getting the following compilation error:

error[E0382]: use of moved value: `split2`
 --> src/main.rs:7:19
  |
7 |         for s2 in split2 {
  |                   ^^^^^^ value moved here in previous iteration of loop
  |
  = note: move occurs because `split2` has type `std::str::SplitWhitespace<'_>`, which does not implement the `Copy` trait

I'm not sure why split2 is getting moved in the first place, but what's the Rust way of writing this function?

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
River Tam
  • 3,096
  • 4
  • 31
  • 51

2 Answers2

5

split2 is getting moved because iterating with for consumes the iterator and since the type does not implement Copy, Rust isn't copying it implicitly.

You can fix this by creating a new iterator inside the first for:

let split1 = thing1.split_whitespace();

let mut matches = 0;
for s1 in split1 {
    for s2 in thing2.split_whitespace() {
        if s1.eq(s2) {
            matches = matches + 1;
            break;
        }
    }
}
...

You can also rewrite the matches counting loop using some higher order functions available in the Iterator trait:

let matches = thing1.split_whitespace()
    .flat_map(|c1| thing2.split_whitespace().filter(move |&c2| c1 == c2))
    .count();

longer_length can also be written as:

let longer_length = std::cmp::max(thing1.len(), thing2.len());
Dogbert
  • 212,659
  • 41
  • 396
  • 397
  • Thank you for the response and a way to get around this. Shouldn't `SplitWhitespace` be copyable, then? Seems inefficient to have to split on each word. – River Tam Nov 06 '16 at 18:06
  • 2
    https://github.com/rust-lang/rust/issues/18045 explains why `SplitWhitespace` doesn't implement `Copy`. Also, this won't be any slower than if `SplitWhitespace` implemented `Copy` since iterators are lazy and creating a `SplitWhitespace` struct has negligible overhead compared to the time taken to iterate through the splits. – Dogbert Nov 06 '16 at 18:11
  • 1
    It is surprising it doesn't implement `Clone` though. – Shepmaster Nov 06 '16 at 18:45
2

There are possibly some better ways to do the word comparison.

If the phrases are long, then iterating over thing2's words for every word in thing1 is not very efficient. If you don't have to worry about words which appear more than once, then HashSet may help, and boils the iteration down to something like:

let words1: HashSet<&str> = thing1.split_whitespace().collect();
let words2: HashSet<&str> = thing2.split_whitespace().collect();
let matches = words1.intersection(&words2).count();

If you do care about repeated words you probably need a HashMap, and something like:

let mut words_hash1: HashMap<&str, usize> = HashMap::new();
for word in thing1.split_whitespace() {
    *words_hash1.entry(word).or_insert(0) += 1;
}
let matches2: usize = thing2.split_whitespace()
                     .map(|s| words_hash1.get(s).cloned().unwrap_or(0))
                     .sum();
Chris Emerson
  • 13,041
  • 3
  • 44
  • 66