0

I need to build an index, where first I check if a string already exists in the store vector, or if not, add it to the store and get its position in it.

Once done, I will need to consume the store, and discard the index. I would like to avoid creating each string twice (once for the index and once for the store).

struct StringIndex {
    store: Vec<String>,
    index: HashMap<&'this str, usize>,
}

impl StringIndex {
  pub fn get_or_insert(&mut self, key: &str) -> usize {
    match self.index.entry(key) {
      Occupied(v) => v.key(),
      Vacant(v) => {
        let idx = self.store.len();
        self.store.push(key.to_string());  // Create a new clone
        v.insert(idx);
        idx
      }
    }
  }
}

I looked at self_cell and ouroboros, but neither seem to be targetting this use case... or not?

Yuri Astrakhan
  • 8,808
  • 6
  • 63
  • 97
  • You can probably do it with an `Rc` instead of a `String` and `&str`. – Jmb Apr 03 '23 at 08:07
  • 1
    Your use case seems very similar to this: https://matklad.github.io/2020/03/22/fast-simple-rust-interner.html – Gurwinder Singh Apr 03 '23 at 08:14
  • 1
    Relevant: https://stackoverflow.com/a/32300133 – E_net4 Apr 03 '23 at 09:15
  • Seems like the user who gave an answer got upset that their answer was mega-slow (150 times slower in my benchmarks), deleted their answer and -1ed the question... Or maybe its just a coincidence that the answer was deleted AND the question got -1 at the same time... – Yuri Astrakhan Apr 06 '23 at 16:58

1 Answers1

1

I ended up creating a dup-indexer crate / github / docs. Works with any values (both allocated and copy-able), and mega-performant:

use dup_indexer::DupIndexer;

fn main() {
    let mut di = DupIndexer::new();
    assert_eq!(0, di.insert("hello".to_string()));
    assert_eq!(1, di.insert("world".to_string()));
    assert_eq!(0, di.insert("hello".to_string()));
    assert_eq!(di.into_vec(), vec!["hello".to_string(), "world".to_string()]);
}

DupIndexer creates a shallow non-droppable copy of the value, and stores it in the hashmap, whereas the original value goes into the vector:

pub struct DupIndexer<T> {
    values: Vec<T>,
    lookup: HashMap<ManuallyDrop<T>, usize>,
}

// impl
pub fn insert(&mut self, value: T) -> usize {
    let dup_value = ManuallyDrop::new(unsafe { ptr::read(&value) });
    match self.lookup.entry(dup_value) {
        Occupied(entry) => *entry.get(),
        Vacant(entry) => {
            let index = self.values.len();
            entry.insert(index);
            self.values.push(value);
            index
        }
    }
}

I believe the above code is safe because the hashmap only keeps the ptr:read-created copy of the original value while we own it, and the value is never modified.

Note that Miri does complain about this code, but only in one case: if I use Box<T> as the value. I do not know if this is a valid concern or not. See the safety section with the error output

Yuri Astrakhan
  • 8,808
  • 6
  • 63
  • 97