1

So I've read Why can't I store a value and a reference to that value in the same struct? and I understand why my naive approach to this was not working, but I'm still very unclear how to better handle my situation.

I have a program I wanted to structure like follows (details omitted because I can't make this compile anyway):

use std::sync::Mutex;

struct Species{
    index : usize,
    population : Mutex<usize>
}
struct Simulation<'a>{
    species : Vec<Species>,
    grid : Vec<&'a Species>
}
impl<'a> Simulation<'a>{
    pub fn new() -> Self {...} //I don't think there's any way to implement this
    pub fn run(&self) {...}
}

The idea is that I create a vector of Species (which won't change for the lifetime of Simulation, except in specific mutex-guarded fields) and then a grid representing which species live where, which will change freely. This implementation won't work, at least not any way I've been able to figure out. As I understand it, the issue is that pretty much however I make my new method, the moment it returns, all of the references in grid would becomine invalid as Simulation and therefor Simulation.species are moved to another location in the stack. Even if I could prove to the compiler that species and its contents would continue to exist, they actually won't be in the same place. Right?

I've looked into various ways around this, such as making species as an Arc on the heap or using usizes instead of references and implementing my own lookup function into the species vector, but these seem slower, messier or worse. What I'm starting to think is that I need to really re-structure my code to look something like this (details filled in with placeholders because now it actually runs):

use std::sync::Mutex;

struct Species{
    index : usize,
    population : Mutex<usize>
}
struct Simulation<'a>{
    species : &'a Vec<Species>, //Now just holds a reference rather than data
    grid : Vec<&'a Species>
}
impl<'a> Simulation<'a>{
    pub fn new(species : &'a Vec <Species>) -> Self { //has to be given pre-created species
        let grid = vec!(species.first().unwrap(); 10);
        Self{species, grid}
    }
    pub fn run(&self) {
        let mut population = self.grid[0].population.lock().unwrap();
        println!("Population: {}", population);
        *population += 1;
    }
}

pub fn top_level(){
    let species = vec![Species{index: 0, population : Mutex::new(0_)}];
    let simulation = Simulation::new(&species);
    simulation.run();
}

As far as I can tell this runs fine, and ticks off all the ideal boxes:

  • grid uses simple references with minimal boilerplate for me
  • these references are checked at compile time with minimal overhead for the system
  • Safety is guaranteed by the compiler (unlike a custom map based approach)

But, this feels very weird to me: the two-step initialization process of creating owned memory and then references can't be abstracted any way that I can see, which feels like I'm exposing an implementation detail to the calling function. top_level has to also be responsible for establishing any other functions or (scoped) threads to run the simulation, call draw/gui functions, etc. If I need multiple levels of references, I believe I will need to add additional initialization steps to that level.

So, my question is just "Am I doing this right?". While I can't exactly prove this is wrong, I feel like I'm losing a lot of near-universal abstraction of the call structure. Is there really no way to return species and simulation as a pair at the end (with some one-off update to make all references point to the "forever home" of the data).

Phrasing my problem a second way: I do not like that I cannot have a function with a signature of ()-> Simulation, when I can can have a pair of function calls that have that same effect. I want to be able to encapsulate the creation of this simulation. I feel like the fact that this approach cannot do so indicates I'm doing something wrong, and that there may be a more idiomatic approach I'm missing.

Edward Peters
  • 3,623
  • 2
  • 16
  • 39
  • `Vec` is heap allocated. Only a few bytes that store its size, capacity and pointer to the heap allocated storage are stored on the stack. So all the `Species` kept int the `species` vector stay where they are after `new()`, and no references are invalidated – Alexey S. Larionov Apr 23 '22 at 15:26
  • Besides, the code you've attached compiles just fine. What is the actual problem? – Alexey S. Larionov Apr 23 '22 at 15:26
  • @AlexeyLarionov Didn't realize vector was heap allocated, but the first version won't compile anyway - it still complains about moves for any version I tried to implement, for basically the same reason described in the self-referential struct thing. In any event, the issue should apply for any stack-allocated thing. As for the actual problem, as I said, the version that compiles requires exposing implementation details to the calling function. – Edward Peters Apr 23 '22 at 16:01
  • Separating the immutable data from the immutable seems like _good_ design to me. This way you have the option to reuse your species data for multiple concurrently-running simulations. – cdhowie Apr 23 '22 at 20:43
  • "from the _mutable_", of course – cdhowie Apr 23 '22 at 20:52
  • @cdhowie I think it's an issue that I can't encapsulate the instantiation. It's also not exactly a matter of mutability and immutability - the species hold mutexes, so they aren't "re used" either way. – Edward Peters Apr 23 '22 at 20:53
  • @EdwardPeters "Load species data" is a separate operation from "start simulation with this species data," so I'm not sure what you mean by "encapsulate the instantiation." You can always have a helper function to do both in one call, if you feel that's necessary. – cdhowie Apr 23 '22 at 21:00
  • @cdhowie I actually don't think I can do both in one call, at least not any way that I've found. Creating both the species vector and the grid and returning both together fails due to move/borrow reasons. – Edward Peters Apr 23 '22 at 21:09
  • @EdwardPeters It's possible with pinning and `unsafe` but is overkill here anyway. Loading the data and starting a simulation seem like two operations to me, I'm not sure why you'd want them to be a single function anyway. – cdhowie Apr 23 '22 at 21:12
  • @cdhowie so it's not really "starting the simulation" but "building the usable state". 'run' is what actually starts the simulation. In the fuller version there are three methods that I want exposed - 'new', 'run' and 'draw'. Both creating the species list and creating the grid are part of 'new', but have to be done in the same call. – Edward Peters Apr 23 '22 at 21:17
  • You might find it interesting to try to use an [ECS](https://en.wikipedia.org/wiki/Entity_component_system) here, like what the [Bevy engine](https://bevyengine.org/learn/book/getting-started/ecs/) provides. – SirDarius Apr 23 '22 at 23:15

2 Answers2

5

I've looked into various ways around this, such as making species as an Arc on the heap or using usizes instead of references and implementing my own lookup function into the species vector, but these seem slower, messier or worse.

Don't assume that, test it. I once had a self-referential (using ouroboros) structure much like yours, with a vector of things and a vector of references to them. I tried rewriting it to use indices instead of references, and it was faster.

Rc/Arc is also an option worth trying out — note that there is only an extra cost to the reference counting when an Arc is cloned or dropped. Arc<Species> doesn't cost any more to dereference than &Species, and you can always get an &Species from an Arc<Species>. So the reference counting only matters if and when you're changing which Species is in an element of Grid.

Kevin Reid
  • 37,492
  • 13
  • 80
  • 108
  • That's interesting. What about stack vs. heap speed differences? As I understand it the 'Vector' implementation above is heap anyway, but I could use a stack structure instead of I'm not using an Arc, right? – Edward Peters Apr 23 '22 at 18:37
  • Actually, the *real* performance improvement is very likely due to the fact that when you store your data in an array with indices, instead of anywhere in the memory with pointers, all your data is stored next to each other, so it's a lot better from a CPU cache perspective. – jthulhu Apr 23 '22 at 20:27
  • @EdwardPeters If you switched to using **arrays**, so that the size is fixed at compile time, you could store data without indirection, and therefore possibly on the stack (if it is small enough to fit). But don't assume that stack is faster; stack isn't a fundamentally different kind of memory, it's an _allocation method._ – Kevin Reid Apr 23 '22 at 22:11
  • Thinking about it "just use Arcs" sounds like the real answer, at least for my use case. I'm still a little uneasy that my implementation should have *slightly* lower overhead but seems to defy encapsulation, but it seems it'd take something weird to make that work. – Edward Peters Apr 23 '22 at 23:16
2

If you're owning a Vec of objects, then want to also keep track of references to particular objects in that Vec, a usize index is almost always the simplest design. It might feel like extra boilerplate to you now, but it's a hell of a lot better than properly dealing with keeping pointers in check in a self-referential struct (as somebody who's made this mistake in C++ more than I should have, trust me). Rust's rules are saving you from some real headaches, just not ones that are obvious to you necessarily.

If you want to get fancy and feel like a raw usize is too arbitrary, then I recommend you look at slotmap. For a simple SlotMap, internally it's not much more than an array of values, iteration is fast and storage is efficient. But it gives you generational indices (slotmap calls these "keys") to the values: each value is embellished with a "generation" and each index also internally keeps hold of a its generation, therefore you can safely remove and replace items in the Vec without your references suddenly pointing at a different object, it's really cool.

JMAA
  • 1,730
  • 13
  • 24