5

tl;dr What is the best "Rust way" to create some byte storage, in this case a Vec<u8>, store that Vec<u8> in struct field that can be accessed with a key value (like a BTreeMap<usize, &Vec<u8>>), and later read those Vec<u8> from some other structs?
      Can this be extrapolated to a general good rust design for similar structs that act as storage and cache for blobs of bytes (Vec<u8>, [u8; 16384], etc.) accessible with a key (an usize offset, a u32 index, a String file path, etc.)?

Goal

I'm trying to create a byte storage struct and impl functions that:

  1. stores 16384 bytes read from disk on demand into "blocks" of Vec<u8> of capacity 16384
  2. other struct will analyze the various Vec<u8> and may need store their own references to those "blocks"
  3. be efficient: have only one copy of a "block" in memory, avoid unnecessary copying, clones, etc.

Unfortunately, for each implementation attempt, I run into difficult problems of borrowing, lifetime ellision, mutability, copying, or other problems.

Reduced Code example

I created a struct BlockReader that

  1. creates a Vec<u8> (Vec<u8>::with_capacity(16384)) typed as Block
  2. reads from a file (using File::seek and File::take::read_to_end) and stores 16384 of u8 into a Vec<u8>
  3. stores a reference to the Vec<u8> within a BTreeMap typed as Blocks

(playground code)

use std::io::Seek;
use std::io::SeekFrom;
use std::io::Read;
use std::fs::File;
use std::collections::BTreeMap;

type Block = Vec<u8>;
type Blocks<'a> = BTreeMap<usize, &'a Block>;

pub struct BlockReader<'a> {
    blocks: Blocks<'a>,
    file: File,
}

impl<'a> BlockReader<'a> {
    /// read a "block" of 16384 `u8` at file offset 
    /// `offset` which is multiple of 16384
    /// if the "block" at the `offset` is cached in
    /// `self.blocks` then return a reference to that
    /// XXX: assume `self.file` is already `open`ed file
    ///      handle
    fn readblock(& mut self, offset: usize) -> Result<&Block, std::io::Error> {
        // the data at this offset is the "cache"
        // return reference to that
        if self.blocks.contains_key(&offset) {
            return Ok(&self.blocks[&offset]);
        }
        // have not read data at this offset so read
        // the "block" of data from the file, store it,
        // return a reference
        let mut buffer = Block::with_capacity(16384);
        self.file.seek(SeekFrom::Start(offset as u64))?;
        self.file.read_to_end(&mut buffer);
        self.blocks.insert(offset, & buffer);
        Ok(&self.blocks[&offset])
    }
}

example use-case problem

There have been many problems with each implementation. For example, two calls to BlockReader.readblock by a struct BlockAnalyzer1 have caused endless difficulties:

pub struct BlockAnalyzer1<'b> {
   pub blockreader: BlockReader<'b>,
}

impl<'b> BlockAnalyzer1<'b> {
    /// contrived example function
    pub fn doStuff(&mut self) -> Result<bool, std::io::Error> {
        let mut b: &Block;
        match self.blockreader.readblock(3 * 16384) {
            Ok(val) => {
                b = val;
            },
            Err(err) => {
                return Err(err);
            }
        }
        match self.blockreader.readblock(5 * 16384) {
            Ok(val) => {
                b = val;
            },
            Err(err) => {
                return Err(err);
            }
        }
        Ok(true)
    }
}

results in

error[E0597]: `buffer` does not live long enough
  --> src/lib.rs:34:36
   |
15 | impl<'a> BlockReader<'a> {
   |      -- lifetime `'a` defined here
...
34 |         self.blocks.insert(offset, & buffer);
   |         ---------------------------^^^^^^^^-
   |         |                          |
   |         |                          borrowed value does not live long enough
   |         argument requires that `buffer` is borrowed for `'a`
35 |         Ok(&self.blocks[&offset])
36 |     }
   |     - `buffer` dropped here while still borrowed

However, I ran into many other errors for different permutations of this design, another error I ran into, for example

error[E0499]: cannot borrow `self.blockreader` as mutable more than once at a time
   --> src/main.rs:543:23
    |
463 | impl<'a> BlockUser1<'a> {
    |      ----------- lifetime `'a` defined here
...
505 |             match self.blockreader.readblock(3 * 16384) {
    |                   ---------------------------------------
    |                   |
    |                   first mutable borrow occurs here
    |                   argument requires that `self.blockreader` is borrowed for `'a`
...
543 |                 match self.blockreader.readblock(5 * 16384) {
    |                       ^^^^^^^^^^^^^^^^ second mutable borrow occurs here

In BlockReader, I've tried permutations of "Block" storage using Vec<u8>, &Vec<u8>, Box<Vec<u8>>, Box<&Vec<u8>>, &Box<&Vec<u8>>, &Pin<&Box<&Vec<u8>>, etc. However, each implementation permutation runs into various confounding problems with borrowing, lifetimes, and mutability.

Again, I'm not looking for the specific fix. I'm looking for a generally good rust-oriented design approach to this general problem: store a blob of bytes managed by some struct, have other struct get references (or pointers, etc.) to a blob of bytes, read that blob of bytes in loops (while possibly storing new blobs of bytes).

The Question For Rust Experts

How would a rust expert approach this problem?
How should I store the Vec<u8> (Block) in BlockReader.blocks, and also allow other Struct to store their own references (or pointers, or references to pointers, or pinned Box pointers, or etc.) to a Block?
Should the other structs copy or clone a Box<Block> or a Pin<Box<Block>> or something else?
Would using a different storage like a fixed sized array; type Block = [u8; 16384]; be easier to pass references for?
Should other Struct like BlockUser1 be given &Block, or Box<Block>, or &Pin<&Box<&Block>, or something else?

Again, each Vec<u8> (Block) is written once (during BlockReader.readblock) and may be read many times by other Structs by calling BlockReader.readblock and later by saving their own reference/pointer/etc. to that Block (ideally, maybe that's not ideal?).

JamesThomasMoon
  • 6,169
  • 7
  • 37
  • 63
  • 2
    Implementing a cache is tricky. See: [How does interior mutability work for caching behavior?](https://stackoverflow.com/questions/32062285/how-does-interior-mutability-work-for-caching-behavior) – John Kugelman Aug 31 '21 at 00:25
  • It sounds like you might be re-implementing your OS's page cache at the application level. Have you benchmarked simple file reads and determined that a cache would improve performance? You might be fine just doing regular, unsophisticated, redundant reads. And if not, consider [memory mapping](https://docs.rs/crate/memmap/0.7.0), which will let you access files via memory accesses without needing explicit reads. – John Kugelman Aug 31 '21 at 00:30
  • "_It sounds like you might be re-implementing your OS's page cache_" Thanks @JohnKugelman . Indeed, it looks that way. However, I have further plans that are not just redoing OS-provided capability. But for sake of simplicity, I removed much of that code. I'll look into _memory mapping_ but for now, I'm curious about the larger problem of how to approach any program like this in a rust-friendly manner. I'd guess there are many similar patterns of storing bytes once in specialized storage `struct` and then later reading those bytes with different `struct`s. – JamesThomasMoon Aug 31 '21 at 00:49
  • You are trying to store a reference to a local variable, which is an error, because the local variable `buffer` will be destroyed aat the end of the method. You should store an owned value instead – Svetlin Zarev Aug 31 '21 at 04:19
  • Before trying to *satisfy* the compiler, you should decide, at a coarse grained level, the *ownership* model of your data: who **owns** what, who **knows** what? After you can deal with language related problems (i.e. interior mutability to ease the intended borrows for example). Here, your `BlockReader` does not own the blocks it knows; then who owns them? – prog-fh Aug 31 '21 at 06:19
  • 1
    Side note: `let mut buffer = Vec::with_capacity (16384); file.read_to_end (&mut buffer)?;` [will read read until the end of file, even if that is more than 16kB.](https://stackoverflow.com/questions/68979882/readread-exact-does-not-fill-buffer) – Jmb Aug 31 '21 at 06:37
  • Hi @Jmb in my rudimentary testing the `read_to_end` worked for me. I think because I limited the capacity of the vector. – JamesThomasMoon Aug 31 '21 at 07:09
  • Thanks @prog-fh . "_Here, your BlockReader does not own the blocks it knows; then who owns them?_". That was the point of the question. I'm new to rust and having trouble determining that. I have tried implementations that seemed sensible, yet they all run into catch-22 dead-ends of compiler errors. – JamesThomasMoon Aug 31 '21 at 07:12
  • Thanks @SvetlinZarev. "_You should store an owned value instead_". I tried that, to the best of my knowledge, using things like `Box<&Block>`, `Pin::Box<&Block>`, `Box`, etc. I ended up in catch-22 dead-ends of compiler errors. – JamesThomasMoon Aug 31 '21 at 07:15
  • @JamesThomasMoon anything that has `&` is not an owned value. Storing just `Block` or `Rc` or `Arc` should be fine though – Svetlin Zarev Aug 31 '21 at 07:44
  • You're not limiting the capacity of the vector. You're just pre-allocating a given size, but `read_to_end` is free to increase that size if there is more data to read. Your code will only work if you read the last block of data in a file. – Jmb Aug 31 '21 at 08:07

1 Answers1

3

You can put the Vec<u8> behind an Rc<RefCell<...>> or simply a Rc<..> if they're immutable.

If you need thread-safe access you'll need to use an Arc<Mutex<...>> or Arc<RwLock<...>> instead.

Here's a converted version of your code. (There were a few typos and bits that needed changing to get it to compile - you should really fix those in your example, and give us something that nearly compiles...) You can also see this in the playground

use std::io::Seek;
use std::io::SeekFrom;
use std::io::Read;
use std::fs::File;
use std::cell::RefCell;
use std::rc::Rc;
use std::collections::BTreeMap;

type Block = Vec<u8>;
type Blocks = BTreeMap<usize, Rc<RefCell<Block>>>;

pub struct BlockReader {
    blocks: Blocks,
    file: File,
}

impl BlockReader {
    /// read a "block" of 16384 `u8` at file offset 
    /// `offset` which is multiple of 16384
    /// if the "block" at the `offset` is cached in
    /// `self.blocks` then return a reference to that
    /// XXX: assume `self.file` is already `open`ed file
    ///      handle
    fn readblock(& mut self, offset: usize) -> Result<Rc<RefCell<Block>>,std::io::Error> {
        // the data at this offset is the "cache"
        // return reference to that
        if self.blocks.contains_key(&offset) {
            return Ok(self.blocks[&offset].clone());
        }
        // have not read data at this offset so read
        // the "block" of data from the file, store it,
        // return a reference
        let mut buffer = Block::with_capacity(16384);
        self.file.seek(SeekFrom::Start(offset as u64))?;
        self.file.read_to_end(&mut buffer);
        self.blocks.insert(offset, Rc::new(RefCell::new(buffer)));
        Ok(self.blocks[&offset].clone())
    }
}

pub struct BlockAnalyzer1 {
   pub blockreader: BlockReader,
}

impl BlockAnalyzer1 {
    /// contrived example function
    pub fn doStuff(&mut self) -> Result<bool,std::io::Error> {
        let mut b: Rc<RefCell<Block>>;
        match self.blockreader.readblock(3 * 16384) {
            Ok(val) => {
                b = val;
            },
            Err(err) => {
                return Err(err);
            }
        }
        match self.blockreader.readblock(5 * 16384) {
            Ok(val) => {
                b = val;
            },
            Err(err) => {
                return Err(err);
            }
        }
        Ok(true)
    }
}
Svetlin Zarev
  • 14,713
  • 4
  • 53
  • 82
Michael Anderson
  • 70,661
  • 7
  • 134
  • 187