5

I'm interested in finding or implementing a Rust data structure that provides a zero-cost way to memoize a single computation with an arbitrary output type T. Specifically I would like a generic type Cache<T> whose internal data takes up no more space than an Option<T>, with the following basic API:

impl<T> Cache<T> {
    /// Return a new Cache with no value stored in it yet.
    pub fn new() -> Self {
        // ...
    }

    /// If the cache has a value stored in it, return a reference to the 
    /// stored value. Otherwise, compute `f()`, store its output
    /// in the cache, and then return a reference to the stored value.
    pub fn call<F: FnOnce() -> T>(&self, f: F) -> &T {
        // ...
    }
}

The goal here is to be able to share multiple immutable references to a Cache within a single thread, with any holder of such a reference being able to access the value (triggering the computation if it is the first time). Since we're only requiring to be able to share a Cache within a single thread, it is not necessary for it to be Sync.

Here is a way to implement the API safely (or at least I think it's safe) using unsafe under the hood:

use std::cell::UnsafeCell;

pub struct Cache<T> {
    value: UnsafeCell<Option<T>>
}

impl<T> Cache<T> {
    pub fn new() -> Self {
        Cache { value: UnsafeCell::new(None) }
    }

    pub fn call<F: FnOnce() -> T>(&self, f: F) -> &T {
        let ptr = self.value.get();
        unsafe {
            if (*ptr).is_none() {
                let t = f();
                // Since `f` potentially could have invoked `call` on this
                // same cache, to be safe we must check again that *ptr
                // is still None, before setting the value.
                if (*ptr).is_none() {
                    *ptr = Some(t);
                }
            }
            (*ptr).as_ref().unwrap()
        }
    }
}

Is it possible to implement such a type in safe Rust (i.e., not writing our own unsafe code, only indirectly relying on unsafe code in the standard library)?

Clearly, the call method requires mutating self, which means that Cache must use some form of interior mutability. However, it does not seem possible to do with a Cell, because Cell provides no way to retrieve a reference to the enclosed value, as is required by the desired API of call above. And this is for a good reason, as it would be unsound for Cell to provide such a reference, because it would have no way to ensure that the referenced value is not mutated over the lifetime of the reference. On the other hand, for the Cache type, after call is called once, the API above does not provide any way for the stored value to be mutated again, so it is safe for it to hand out a reference with a lifetime that can be a long as that of the Cache itself.

If Cell can't work, I'm curious if the Rust standard library might provide some other safe building blocks for interior mutability which could be used to implement this Cache.

Neither RefCell nor Mutex fulfill the goals here:

  1. they are not zero-cost: they involve storing more data than that of an Option<T> and add unnecessary runtime checks.
  2. they do not seem to provide any way to return a real reference with the lifetime that we want -- instead we could only return a Ref or MutexGuard which is not the same thing.

Using only an Option would not provide the same functionality: if we share immutable references to the Cache, any holder of such a reference can invoke call and obtain the desired value (and mutating the Cache in the process so that future calls will retrieve the same value); whereas, sharing immutable references to an Option, it would not be possible to mutate the Option, so it could not work.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Brent Kerby
  • 1,397
  • 8
  • 14
  • What about https://doc.rust-lang.org/std/cell/struct.RefCell.html ? – Svetlin Zarev May 20 '19 at 05:44
  • Why the `UnsafeCell`? It is in its name... **unsafe**, don't use it! You don't know what you're doing ;) Btp: Use `RefCell` or a `Mutex`. – hellow May 20 '19 at 05:56
  • On a second thought you cannot return `&T`, with `RefCell` because the `Ref` cannot outlive the function call. – Svetlin Zarev May 20 '19 at 06:04
  • Right, neither `RefCell` nor `Mutex` fulfill the goals here, 1) because they are not zero-cost (they involve storing more data than that of an `Option` and also add unnecessary runtime checks), and 2) they do not seem to provide any way to return a real reference with the lifetime that we want -- instead we could only return a `Ref` or `MutexGuard` which is not the same thing. – Brent Kerby May 20 '19 at 06:17
  • @BrentKerby then you can't achieve what you're looking for. Your code will fail in multi-threaded code and your unsafe will probably hide it. – hellow May 20 '19 at 06:19
  • I don't understand the purpose, why not just use only an option. Your question don't make sense. also your code is broken in case of multi threading. – Stargateur May 20 '19 at 06:24
  • Because of the contained `UnsafeCell`, `Cache` does not implement `Sync`, so Rust will not allow it to be shared across threads, but that is fine. In other words, sharing across threads is not part of the goal here. – Brent Kerby May 20 '19 at 06:24
  • Using only an `Option` would not provide the same functionality: if we share immutable references to the `Cache`, any holder of such a reference can invoke `call` and obtain the desired value (and mutating the `Cache` in the process so that future calls will retrieve the same value); whereas, sharing immutable references to an `Option`, it would not be possible to mutate the `Option`, so it could not work. – Brent Kerby May 20 '19 at 06:29
  • 1
    Could you please edit in the question that you are not looking for a solution which is `Sync`? This was my first question too, and may be the only thing making a solution to your task possible. – Matthieu M. May 20 '19 at 06:44
  • @MatthieuM, done! Yeah, the multithreaded case is interesting too but would inevitably require some kind of synchronization, so I doubt it could be done with only an `Option` amount of storage. – Brent Kerby May 20 '19 at 06:59
  • 1
    @BrentKerby: Oh, I am sure it could; but you would need some kind of "sync Option". Actually, this sparks another question: does it matter if `f` is called multiple times? Notably, if the call to `f()` somehow calls `Cache::call` which triggers another execution of `f()`? – Matthieu M. May 20 '19 at 07:16
  • Yeah, I guess you're right that in theory it should be possible to support multithreaded use still with only `Option` amount of storage, at least if we were okay with `f` possibly being called multiple times (e.g., if a second thread invokes `call` while the first is still computing); in that case, it seems like we would just need to be able to access the discriminant of the Option atomically. But since Rust's layout of the Option is unspecified and may depend on various optimizations, it seems that this would probably require some kind of special compiler support to implement properly. – Brent Kerby May 20 '19 at 07:30
  • I'll have to think more about that though, because it seems like even in the single-threaded case, the ability of `f` to call the same `Cache` again could make it possible for the value to be written more than once, potentially compromising safety. – Brent Kerby May 20 '19 at 07:35
  • 1
    Yeah, I'm convinced now my original implementation was unsound in that it could result in undefined behavior, from mutation still being possible on a value referenced by an immutable reference returned by `call`. I've modified it now to guard against the possibility that `f()` itself may invoke `call` on the same `Cache`. This stuff sure is tricky to get right! – Brent Kerby May 20 '19 at 07:46
  • I'd challenge the purpose of your question. If you are caching something, that's because either its "big" or "expensive" or both. Within that view, adding the extra few bytes for a `RefCell` is a pittance. See [How do I return a reference to something inside a RefCell without breaking encapsulation?](https://stackoverflow.com/q/29401626/155423) for how to avoid the API issues. – Shepmaster May 20 '19 at 14:19
  • Here's one use case: a reactive system with a lot of small functional components arranged in a directed acyclic graph. The cost of a recomputing a component would not have to be very great in order to justify caching it, given that it's possible to cache it so cheaply. The added cost of a `RefCell` could become significant in such a scenario, which might force us to artificially group some of the small components together, adding complexity and maybe hurting performance. Part of what makes Rust awesome is that we can create zero-cost abstractions so we don't have to make compromises like that. – Brent Kerby May 20 '19 at 14:38
  • The cost of a `RefCell` is a pointer-sized integer and a check of that integer. I doubt that either of those are going to show up in any performance trace unless the value being cached is smaller than or equal to a pointer and the computation of the value is less expensive than an integer comparison. – Shepmaster May 20 '19 at 16:14
  • 2
    For what it's worth: [once_cell](https://crates.io/crates/once_cell) – Shepmaster May 20 '19 at 16:18

0 Answers0