11

I have a struct Foo:

struct Foo {
    v: String,
    // Other data not important for the question
}

I want to handle a data stream and save the result into Vec<Foo> and also create an index for this Vec<Foo> on the field Foo::v.

I want to use a HashMap<&str, usize> for the index, where the keys will be &Foo::v and the value is the position in the Vec<Foo>, but I'm open to other suggestions.

I want to do the data stream handling as fast as possible, which requires not doing obvious things twice.

For example, I want to:

  • allocate a String only once per one data stream reading
  • not search the index twice, once to check that the key does not exist, once for inserting new key.
  • not increase the run time by using Rc or RefCell.

The borrow checker does not allow this code:

let mut l = Vec::<Foo>::new();
{
    let mut hash = HashMap::<&str, usize>::new();
    //here is loop in real code, like: 
    //let mut s: String; 
    //while get_s(&mut s) {
    let s = "aaa".to_string();
    let idx: usize = match hash.entry(&s) { //a
        Occupied(ent) => {
            *ent.get()
        }
        Vacant(ent) => {
            l.push(Foo { v: s }); //b
            ent.insert(l.len() - 1);
            l.len() - 1
        }
    };
    // do something with idx
}

There are multiple problems:

  1. hash.entry borrows the key so s must have a "bigger" lifetime than hash
  2. I want to move s at line (b), while I have a read-only reference at line (a)

So how should I implement this simple algorithm without an extra call to String::clone or calling HashMap::get after calling HashMap::insert?

Peter Hall
  • 53,120
  • 14
  • 139
  • 204
user1244932
  • 7,352
  • 5
  • 46
  • 103
  • Here is a testable example: [playground](https://play.rust-lang.org/?gist=8a622fecc6af3ca89ec96aa4ad238afc&version=stable&backtrace=0) – user4815162342 Apr 18 '17 at 08:35
  • 1
    How do you handle collisions? Do you want all indices, only the first, only the last, or make it crash? – Matthieu M. Apr 27 '17 at 11:28

3 Answers3

10

In general, what you are trying to accomplish is unsafe and Rust is correctly preventing you from doing something you shouldn't. For a simple example why, consider a Vec<u8>. If the vector has one item and a capacity of one, adding another value to the vector will cause a re-allocation and copying of all the values in the vector, invalidating any references into the vector. This would cause all of your keys in your index to point to arbitrary memory addresses, thus leading to unsafe behavior. The compiler prevents that.

In this case, there's two extra pieces of information that the compiler is unaware of but the programmer isn't:

  1. There's an extra indirection — String is heap-allocated, so moving the pointer to that heap allocation isn't really a problem.
  2. The String will never be changed. If it were, then it might reallocate, invalidating the referred-to address. Using a Box<[str]> instead of a String would be a way to enforce this via the type system.

In cases like this, it is OK to use unsafe code, so long as you properly document why it's not unsafe.

use std::collections::HashMap;

#[derive(Debug)]
struct Player {
    name: String,
}

fn main() {
    let names = ["alice", "bob", "clarice", "danny", "eustice", "frank"];

    let mut players = Vec::new();
    let mut index = HashMap::new();

    for &name in &names {
        let player = Player { name: name.into() };
        let idx = players.len();

        // I copied this code from Stack Overflow without reading the prose
        // that describes why this unsafe block is actually safe
        let stable_name: &str = unsafe { &*(player.name.as_str() as *const str) };

        players.push(player);
        index.insert(idx, stable_name);
    }

    for (k, v) in &index {
        println!("{:?} -> {:?}", k, v);
    }

    for v in &players {
        println!("{:?}", v);
    }
}

However, my guess is that you don't want this code in your main method but want to return it from some function. That will be a problem, as you will quickly run into Why can't I store a value and a reference to that value in the same struct?.


Honestly, there's styles of code that don't fit well within Rust's limitations. If you run into these, you could:

  • decide that Rust isn't a good fit for you or your problem.
  • use unsafe code, preferably thoroughly tested and only exposing a safe API.
  • investigate alternate representations.

For example, I'd probably rewrite the code to have the index be the primary owner of the key:

use std::collections::BTreeMap;

#[derive(Debug)]
struct Player<'a> {
    name: &'a str,
    data: &'a PlayerData,
}

#[derive(Debug)]
struct PlayerData {
    hit_points: u8,
}

#[derive(Debug)]
struct Players(BTreeMap<String, PlayerData>);

impl Players {
    fn new<I>(iter: I) -> Self
    where
        I: IntoIterator,
        I::Item: Into<String>,
    {
        let players = iter
            .into_iter()
            .map(|name| (name.into(), PlayerData { hit_points: 100 }))
            .collect();
        Players(players)
    }

    fn get<'a>(&'a self, name: &'a str) -> Option<Player<'a>> {
        self.0.get(name).map(|data| Player { name, data })
    }
}

fn main() {
    let names = ["alice", "bob", "clarice", "danny", "eustice", "frank"];

    let players = Players::new(names.iter().copied());

    for (k, v) in &players.0 {
        println!("{:?} -> {:?}", k, v);
    }

    println!("{:?}", players.get("eustice"));
}

Alternatively, as shown in What's the idiomatic way to make a lookup table which uses field of the item as the key?, you could wrap your type and store it in a set container instead:

use std::collections::BTreeSet;

#[derive(Debug, PartialEq, Eq)]
struct Player {
    name: String,
    hit_points: u8,
}

#[derive(Debug, Eq)]
struct PlayerByName(Player);

impl PlayerByName {
    fn key(&self) -> &str {
        &self.0.name
    }
}

impl PartialOrd for PlayerByName {
    fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> {
        Some(self.cmp(other))
    }
}

impl Ord for PlayerByName {
    fn cmp(&self, other: &Self) -> std::cmp::Ordering {
        self.key().cmp(&other.key())
    }
}

impl PartialEq for PlayerByName {
    fn eq(&self, other: &Self) -> bool {
        self.key() == other.key()
    }
}

impl std::borrow::Borrow<str> for PlayerByName {
    fn borrow(&self) -> &str {
        self.key()
    }
}

#[derive(Debug)]
struct Players(BTreeSet<PlayerByName>);

impl Players {
    fn new<I>(iter: I) -> Self
    where
        I: IntoIterator,
        I::Item: Into<String>,
    {
        let players = iter
            .into_iter()
            .map(|name| {
                PlayerByName(Player {
                    name: name.into(),
                    hit_points: 100,
                })
            })
            .collect();
        Players(players)
    }

    fn get(&self, name: &str) -> Option<&Player> {
        self.0.get(name).map(|pbn| &pbn.0)
    }
}

fn main() {
    let names = ["alice", "bob", "clarice", "danny", "eustice", "frank"];

    let players = Players::new(names.iter().copied());

    for player in &players.0 {
        println!("{:?}", player.0);
    }

    println!("{:?}", players.get("eustice"));
}

not increase the run time by using Rc or RefCell

Guessing about performance characteristics without performing profiling is never a good idea. I honestly don't believe that there'd be a noticeable performance loss from incrementing an integer when a value is cloned or dropped. If the problem required both an index and a vector, then I would reach for some kind of shared ownership.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
  • 1
    I wonder if one could use the `owning_ref` crate to obtain a pointer to a memory that is safe to move? I tried the obvious, [replacing `v: String` with `v: OwningRef`](https://pastebin.com/YZG0RUc9), but it still doesn't compile, with pretty much the same error. I wonder if I misunderstood `owning_ref`, or if this is the case of the borrow checker detecting #2, possibility of `String` change. – user4815162342 Apr 24 '17 at 07:00
  • 1
    @user4815162342 yeah, I feel like every time I want to use owning_ref, I don't *quite* understand it – Shepmaster Apr 24 '17 at 12:01
  • Probably a silly question, but if I want to have different indexes (by first name, last name, etc), wha whoulr be a solution ? Using ˋArc` to share ownership pf data between the different indexes or is there another elegant solution ? – yageek Jun 06 '20 at 13:28
  • @yageek `Rc` or `Arc` would be a memory-efficient technique. You could also use something like a string interner or an arena. – Shepmaster Jun 06 '20 at 14:33
  • What is a string interner or arena ? – yageek Jun 06 '20 at 18:52
  • 1
    @yageek [string interning](https://en.wikipedia.org/wiki/String_interning); [arena](https://en.wikipedia.org/wiki/Region-based_memory_management) – Shepmaster Jun 06 '20 at 19:29
6

not increase the run time by using Rc or RefCell.

@Shepmaster already demonstrated accomplishing this using unsafe, once you have I would encourage you to check how much Rc actually would cost you. Here is a full version with Rc:

use std::{
    collections::{hash_map::Entry, HashMap},
    rc::Rc,
};

#[derive(Debug)]
struct Foo {
    v: Rc<str>,
}

#[derive(Debug)]
struct Collection {
    vec: Vec<Foo>,
    index: HashMap<Rc<str>, usize>,
}

impl Foo {
    fn new(s: &str) -> Foo {
        Foo {
            v: s.into(),
        }
    }
}

impl Collection {
    fn new() -> Collection {
        Collection {
            vec: Vec::new(),
            index: HashMap::new(),
        }
    }

    fn insert(&mut self, foo: Foo) {
        match self.index.entry(foo.v.clone()) {
            Entry::Occupied(o) => panic!(
                "Duplicate entry for: {}, {:?} inserted before {:?}",
                foo.v,
                o.get(),
                foo
            ),
            Entry::Vacant(v) => v.insert(self.vec.len()),
        };
        self.vec.push(foo)
    }
}

fn main() {
    let mut collection = Collection::new();

    for foo in vec![Foo::new("Hello"), Foo::new("World"), Foo::new("Go!")] {
        collection.insert(foo)
    }

    println!("{:?}", collection);
}
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • @Shepmaster: Personally, I'd just like `Rc` to be possible. The only thing really missing is a `Rc::clone_from(&T) where T: ?Sized + Clone`, though how to guess the necessary target size in a generic manner seems... complicated :x The other way would be to be to manipulate `str` directly, like `[u8]`, I suppose, which is not too easy either. – Matthieu M. Apr 27 '17 at 12:36
  • @Shepmaster: Yes; that's what I was saying about the difficulty of "generalizing" `clone_from`. For non-`Sized` type you would need some kind of specific trait that gives you (1) the size of the memory buffer necessary and (2) a way to move/clone into that buffer. – Matthieu M. Apr 27 '17 at 12:45
1

The error is:

error: `s` does not live long enough
  --> <anon>:27:5
   |
16 |         let idx: usize = match hash.entry(&s) { //a
   |                                            - borrow occurs here
...
27 |     }
   |     ^ `s` dropped here while still borrowed
   |
   = note: values in a scope are dropped in the opposite order they are created

The note: at the end is where the answer is.

s must outlive hash because you are using &s as a key in the HashMap. This reference will become invalid when s is dropped. But, as the note says, hash will be dropped after s. A quick fix is to swap the order of their declarations:

let s = "aaa".to_string();
let mut hash = HashMap::<&str, usize>::new();

But now you have another problem:

error[E0505]: cannot move out of `s` because it is borrowed
  --> <anon>:22:33
   |
17 |         let idx: usize = match hash.entry(&s) { //a
   |                                            - borrow of `s` occurs here
...
22 |                 l.push(Foo { v: s }); //b
   |                                 ^ move out of `s` occurs here

This one is more obvious. s is borrowed by the Entry, which will live to the end of the block. Cloning s will fix that:

l.push(Foo { v: s.clone() }); //b

I only want to allocate s only once, not cloning it

But the type of Foo.v is String, so it will own its own copy of the str anyway. Just that type means you have to copy the s.

You can replace it with a &str instead which will allow it to stay as a reference into s:

struct Foo<'a> {
    v: &'a str,
}

pub fn main() {
    // s now lives longer than l
    let s = "aaa".to_string();
    let mut l = Vec::<Foo>::new();
    {
        let mut hash = HashMap::<&str, usize>::new();

        let idx: usize = match hash.entry(&s) {
            Occupied(ent) => {
                *ent.get()
            }
            Vacant(ent) => {
                l.push(Foo { v: &s });
                ent.insert(l.len() - 1);
                l.len() - 1
            }
        };
    }
}

Note that, previously I had to move the declaration of s to before hash, so that it would outlive it. But now, l holds a reference to s, so it has to be declared even earlier, so that it outlives l.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Peter Hall
  • 53,120
  • 14
  • 139
  • 204
  • Actually, I do this in loop, and real code generate `String` on the fly, so you suggest create plus one container for `String`, but it is impossible, because of I can not extend container, and have reference to String at the same time. – user1244932 Apr 17 '17 at 23:00
  • My idea, that hash for `&str` depend on content of `&str`, not its address, so it would be nice reuse `entry` result. – user1244932 Apr 17 '17 at 23:18
  • I'm not fully understanding what you are asking. Could you add to your code example, so that it's clearer what you are trying to do? – Peter Hall Apr 18 '17 at 00:26
  • I don't understand why the borrow checker doesn't allow [the code](https://play.rust-lang.org/?gist=8a622fecc6af3ca89ec96aa4ad238afc&version=stable&backtrace=0) to compile. Even when a different string is pushed to the vector, the borrow checker [still complains](https://play.rust-lang.org/?gist=13714cbd78589da0a92d9cab868da609&version=stable&backtrace=0). – user4815162342 Apr 18 '17 at 08:37
  • It will compile if you flip the order of the `hash` and `s` definitions (and don't move the string): https://play.rust-lang.org/?gist=98fd4e8e6f04c0c5cfe581fac3019047&version=stable&backtrace=0. The hashmap lives just a little bit longer than the string (which is too long since the hashmap has a reference to the string). – Jacob Brown Apr 18 '17 at 16:26
  • 2
    @user1244932 if you are saying that this doesn't work for you because your real code uses a loop, then it would perhaps be useful to include the loops in your code sample. – Peter Hall Apr 20 '17 at 18:22
  • @PeterHall I add comment to code to show where loop, order of creation is important for me, I create `l` to hold generated `s`, reverse order is impossible, it is like creating chicken before egg. – user1244932 Apr 22 '17 at 17:08
  • @PeterHall And I understand why borrow checker complains and how to make it happy with extra heap allocation or extra CPU usage question is how to do things with zero cost. Like find place in hash without borrowing, and how have exactly one copy of string without things like `Rc` – user1244932 Apr 22 '17 at 17:12
  • @PeterHall Not the OP, but [here is a modified sample](https://play.rust-lang.org/?gist=69c6e1e3ddbab3192192ff37eeed6603&version=stable&backtrace=0) that uses a loop in the way I believe the OP needs. – user4815162342 Apr 22 '17 at 20:07