7

Suppose I'm trying to do a fancy zero-copy parser in Rust using &str, but sometimes I need to modify the text (e.g. to implement variable substitution). I really want to do something like this:

fn main() {
    let mut v: Vec<&str> = "Hello there $world!".split_whitespace().collect();

    for t in v.iter_mut() {
        if (t.contains("$world")) {
            *t = &t.replace("$world", "Earth");
        }
    }

    println!("{:?}", &v);
}

But of course the String returned by t.replace() doesn't live long enough. Is there a nice way around this? Perhaps there is a type which means "ideally a &str but if necessary a String"? Or maybe there is a way to use lifetime annotations to tell the compiler that the returned String should be kept alive until the end of main() (or have the same lifetime as v)?

oli_obk
  • 28,729
  • 6
  • 82
  • 98
Timmmm
  • 88,195
  • 71
  • 364
  • 509

2 Answers2

9

Rust has exactly what you want in form of a Cow (Clone On Write) type.

use std::borrow::Cow;

fn main() {
    let mut v: Vec<_> = "Hello there $world!".split_whitespace()
                                             .map(|s| Cow::Borrowed(s))
                                             .collect();

    for t in v.iter_mut() {
        if t.contains("$world") {
            *t.to_mut() = t.replace("$world", "Earth");
        }
    }

    println!("{:?}", &v);
}

as @sellibitze correctly notes, the to_mut() creates a new String which causes a heap allocation to store the previous borrowed value. If you are sure you only have borrowed strings, then you can use

*t = Cow::Owned(t.replace("$world", "Earth"));

In case the Vec contains Cow::Owned elements, this would still throw away the allocation. You can prevent that using the following very fragile and unsafe code (It does direct byte-based manipulation of UTF-8 strings and relies of the fact that the replacement happens to be exactly the same number of bytes.) inside your for loop.

let mut last_pos = 0; // so we don't start at the beginning every time
while let Some(pos) = t[last_pos..].find("$world") {
    let p = pos + last_pos; // find always starts at last_pos
    last_pos = pos + 5;
    unsafe {
        let s = t.to_mut().as_mut_vec(); // operating on Vec is easier
        s.remove(p); // remove $ sign
        for (c, sc) in "Earth".bytes().zip(&mut s[p..]) {
            *sc = c;
        }
    }
}

Note that this is tailored exactly to the "$world" -> "Earth" mapping. Any other mappings require careful consideration inside the unsafe code.

oli_obk
  • 28,729
  • 6
  • 82
  • 98
  • 2
    The `to_mut` here only creates an unnecessary `String` value (involves heap memory allocation) which is immediately overwritten (involves deallocation). I'd change the line to `*t = Cow::Owned(t.replace("$world", "Earth"));` to avoid this overhead. – sellibitze Jul 06 '15 at 13:20
  • 1
    Your last example probably should have more warnings beyond "careful consideration" placed around it. It does direct byte-based manipulation of UTF-8 strings and relies of the fact that the replacement happens to be exactly the same number of bytes. It's definitely an optimization, but not a universally applicable one. – Shepmaster Jul 06 '15 at 14:16
  • added more warnings and some bold text. I wonder if a PR adding a `replace(&mut self, needle, value)` function to the `String` struct would be accepted – oli_obk Jul 06 '15 at 14:37
8

std::borrow::Cow, specifically used as Cow<'a, str>, where 'a is the lifetime of the string being parsed.

use std::borrow::Cow;

fn main() {
    let mut v: Vec<Cow<'static, str>> = vec![];
    v.push("oh hai".into());
    v.push(format!("there, {}.", "Mark").into());

    println!("{:?}", v);
}

Produces:

["oh hai", "there, Mark."]
DK.
  • 55,277
  • 5
  • 189
  • 162