6

I am learning strings in Rust. I want to implement function to calculate the longes common prefix among list of strings.

My code

impl Solution {
    pub fn get_common_prefix(s1: &String, s2: &String) -> String {
        let mut idx: usize = 0;
        if s1.len() > s2.len() {
            std::mem::swap(&mut s1, &mut s2);
        }
        
        while idx < s1.len() && s1.chars().nth(idx) == s2.chars().nth(idx) {
            idx += 1;    
        }
        return s1[0..idx].to_string();
        
        
    }
    pub fn longest_common_prefix(mut strs: Vec<String>) -> String {
        strs.sort();
        let mut longest_pref = strs[0];
        for i in 0..strs.len() {
            longest_pref = Self::get_common_prefix(&longest_pref, &strs[i]);
        }
        return longest_pref;
    }
}

I have an error. Could you please help to fix it?

Line 5, Char 37: lifetime mismatch (solution.rs)
  |
2 |     pub fn get_common_prefix(s1: &String, s2: &String) -> String {
  |                                  -------      ------- these two types are declared with different lifetimes...
...
5 |             std::mem::swap(&mut s1, &mut s2);
  |                                     ^^^^^^^ ...but data from `s2` flows into `s1` here

I was reading about lifetimes here https://doc.rust-lang.org/rust-by-example/scope/lifetime.html but didn't succeeded

mascai
  • 1,373
  • 1
  • 9
  • 30
  • 1
    Not an answer, but you should take `&str` arguments instead of `&String`. See: https://stackoverflow.com/q/40006219/5397009 – Jmb May 15 '23 at 06:35
  • 1
    To solve your problem, simply tell the compiler that both variables have the same lifetime: `fn get_common_prefix<'a> (s1: &'a String, s2: 'a String) -> String` – Jmb May 15 '23 at 06:36

2 Answers2

12

If you tried to explicit the implicit lifetimes, you would get the following signature for get_common_prefix:

fn get_common_prefix<'a, 'b>(s1: &'a String, s2: &'b String) -> String { 
    ...
}

In particular, you can't swap these two values, because both borrows don't last as long as each other. Instead, you could do

fn get_common_prefix<'a>(s1: &'a String, s2: &'a String) -> String { 
    ...
}

Doing so will solve the issue, but will also throw a lot of other errors, because there are other issues with your code. Let's go through them one by one.

First of all, it will now complain that std::mem::swap(&mut s1, &mut s2); is illegal because

error[E0596]: cannot borrow `s1` as mutable, as it is not declared as mutable
 --> src/main.rs:4:24
  |
4 |         std::mem::swap(&mut s1, &mut s2);
  |                        ^^^^^^^ cannot borrow as mutable
  |
help: consider changing this to be mutable
  |
1 | pub fn get_common_prefix<'a>(mut s1: &'a String, s2: &'a String) -> String {
  |                              +++

error[E0596]: cannot borrow `s2` as mutable, as it is not declared as mutable
 --> src/main.rs:4:33
  |
4 |         std::mem::swap(&mut s1, &mut s2);
  |                                 ^^^^^^^ cannot borrow as mutable
  |
help: consider changing this to be mutable
  |
1 | pub fn get_common_prefix<'a>(s1: &'a String, mut s2: &'a String) -> String {
  |                                              +++

However, Rust is quite nice and tells you what to do in this case, declare both s1 and s2 as mutable:

fn get_common_prefix<'a>(mut s1: &'a String, mut s2: &'a String) -> String { 
    ...
}

The function get_common_prefix is now correct, but there is still an error in longest_common_prefix:

error[E0507]: cannot move out of index of `Vec<String>`
  --> src/main.rs:16:28
   |
16 |     let mut longest_pref = strs[0];
   |                            ^^^^^^^ move occurs because value has type `String`, which does not implement the `Copy` trait
   |
help: consider borrowing here
   |
16 |     let mut longest_pref = &strs[0];
   |                            +

The problem is that you are taking strs[0] from strs, but not removing it, which is illegal, because the first string would now be owned twice (once by longest_pref, and once by strs).

One solution is to actually take strs[0], with swap_remove (which is very efficient at removing elements from a vector, assuming we don't care about the order of the elements):

pub fn longest_common_prefix(mut strs: Vec<String>) -> String {
    strs.sort();
    let mut longest_pref = strs.swap_remove(0);
    for i in 1..strs.len() {
        longest_pref = get_common_prefix(&longest_pref, &strs[i]);
    }
    return longest_pref;
}

This works, but... it's quite inefficient, for several reasons. First of all, even though it's not a terrible performance hit, it's almost always wrong to have a &String in a function signature, because all you can do with it (actually, Rust will do it for you so maybe you don't realize it) is to Deref it to a &str, which is basically the same with one less indirection (because you can see String as a pointer to str). So we should directly write that instead

fn get_common_prefix<'a>(s1: &'a str, s2: &'a str) -> String { 
    ...
}

Furthermore, there is no point in returning a String, which requires an allocation, when we can just return a slice over that string (because we are taking substrings):

pub fn get_common_prefix<'a>(mut s1: &'a str, mut s2: &'a str) -> &'a str {
    let mut idx: usize = 0;
    if s1.len() > s2.len() {
        std::mem::swap(&mut s1, &mut s2);
    }
    
    while idx < s1.len() && s1.chars().nth(idx) == s2.chars().nth(idx) {
        idx += 1;    
    }
    return &s1[0..idx]
}

Notice that I changed both the return type and the last line.

Now, for this to work, we also need to adapt longest_common_prefix to work with string slices:

pub fn longest_common_prefix(mut strs: Vec<String>) -> String {
    strs.sort();
    let mut longest_pref: &str = &strs[0];
    for i in 1..strs.len() {
        longest_pref = get_common_prefix(&longest_pref, &strs[i]);
    }
    return longest_pref.to_string();
}

We went back to just referencing the first element of strs, not taking it. Also, we perform the allocation into a String just once, when we actually have to return a String.

There are still some other optimizations to be done. First, sorting strs is useless, it won't change the result of the longest_common_prefix, so we can remove that

pub fn longest_common_prefix(strs: Vec<String>) -> String {
    let mut longest_pref: &str = &strs[0];
    for i in 1..strs.len() {
        longest_pref = get_common_prefix(&longest_pref, &strs[i]);
    }
    return longest_pref.to_string();
}

Next, s1.chars().nth(i) is very slow (Θ(i)). A more efficient way to do this is to reuse the same iterator (s1.chars()), and to advance it at each step, as so

for (c1, c2) in s1.chars().zip(s2.chars()) {
    if c1 != c2 {
        break;
    }
    idx += 1;    
}

.zip() will not take any leftover characters from the longest string, so we can actually remove the swap at all, getting

pub fn get_common_prefix<'a>(s1: &'a str, s2: &'a str) -> &'a str {
    let mut idx: usize = 0;
    for (c1, c2) in s1.chars().zip(s2.chars()) {
        if c1 != c2 {
            break;
        }
        idx += c1.len_utf8();    
    }
    return &s1[0..idx]
}

Note: as indicated by @SebastianRedi, idx shouldn't be increased by 1 but by c1.len_utf8(), because when indexing a string, the indexes are expressed in bytes, not in characters, and some characters are more than a single byte long.

jthulhu
  • 7,223
  • 2
  • 16
  • 33
  • 3
    Very nice comprehensive answer. There's still a bug left for non-ASCII strings, though: `chars()` iterates Unicode codepoints, but slicing counts bytes. Getting the common prefix of "ölig" and "ölen" will return "ö" ("öl" is 2 chars, but the first 2 bytes is just "ö"), of "ölig" and "öd" will panic ("ö" is 1 char, but slicing at 1 byte will slice into the UTF-8 sequence of "ö", which panics). The loop needs to do `idx += c1.len_utf8()` instead of `idx += 1`. And then you *still* don't have it correct for grapheme clusters. – Sebastian Redl May 15 '23 at 07:16
  • @SebastianRedl great catch! I've gotten it wrong so many times and yet I still do that mistake. However, I don't know why it would be incorrect for grapheme clusters, do you know any resource about it? – jthulhu May 15 '23 at 07:28
  • 1
    About the grapheme clusters, imagine you have in `s1` an `a` followed by `U+0301` (COMBINING ACUTE ACCENT), that would be rendered as `á`, and in `s2` an `a` plus `U+0302` that would be `â`. This code would declare a common prefix of `a`, that is probably incorrect. More complex grapheme clusters would give even more incorrect prefixes. – rodrigo May 15 '23 at 08:10
  • 2
    Getting this right for Unicode strings would also require some Unicode normalization, since the same character can be represented in multiple ways. – Sven Marnach May 15 '23 at 09:45
3

In response to the discussion in the comments on the other answer, Here's an implementation of longest_common_prefix() that correctly uses extended grapheme clusters and works for arbitrary iterators of string slices:

use unicode_segmentation::UnicodeSegmentation;

pub fn longest_common_prefix<'a, I>(strings: I) -> &'a str
where
    I: IntoIterator<Item = &'a str>,
{
    let mut iters: Vec<_> = strings.into_iter().map(|s| s.graphemes(true)).collect();
    match iters.len() {
        0 => return "",
        1 => return iters[0].as_str(),
        _ => {}
    };
    let first = iters[0].as_str();
    let mut length = 0;
    loop {
        let Some(g) = iters[0].next() else { break; };
        if iters[1..].iter_mut().any(|iter| iter.next() != Some(g)) {
            break;
        }
        length += g.len();
    }
    &first[..length]
}

Unicode normalization still needs to be performed outside of this function, e.g. like this:

use unicode_normalization::UnicodeNormalization;

let strings = ...; // the input strings
let normalized: Vec<String> = strings.iter().map(|s| s.nfc().collect()).collect();
let common_prefix = longest_common_prefix(normalized.iter().map(|s| s.as_str()));

The code above uses these dependencies:

unicode-segmentation = "1.10.1"
unicode-normalization = "0.1.22"
Sven Marnach
  • 574,206
  • 118
  • 941
  • 841