1

"I'm new to Rust and I'm working on a project where I need to scan files within a large number of folders and save the filtered data to a JSON file. I'm currently using Rayon to perform a quick foreach loop on a 'Vec' containing the folders. Within the loop, I read a file, filter the useful information and save it to a file.

This is the final working version. However I suspect is not the best solution.

fn main() {
    // ...

    // Imagine this is full of data
    let mut folder_nas: Vec<FolderNAS> = Vec::new();

    // Open a out.json file to store the results in append mode
    let mut file = OpenOptions::new()
        .write(true)
        .append(true)
        .open(FILENAME)
        .unwrap();
    file.write_all("Some data").unwrap();

    folder_nas.par_iter().for_each(|x| {
        let mut file_iterator = OpenOptions::new()
            .write(true)
            .append(true)
            .open(FILENAME)
            .unwrap();
        file_iterator
            .write_all("Some filtered data")
            .unwrap();
    });
    file.write_all("Some data").unwrap();
}

At first, coming from other languages I tried this.

fn main() {
    // ...

    // Imagine this is full of data
    let mut folder_nas: Vec<FolderNAS> = Vec::new();

    // Open a out.json file to store the results in append mode
    let mut file = OpenOptions::new()
        .write(true)
        .append(true)
        .open(FILENAME)
        .unwrap();
    file.write_all("Some data").unwrap();

    folder_nas.par_iter().for_each(|x| {
        // Notice the name difference
        file.write_all("Some filtered data")
            .unwrap();
    });
    file.write_all("Some data").unwrap();
}

This approach ended up giving me an error as the file variable is used in the for_each and later. My solution is opening a new OpenOptions writer in the for_each. But my question is, how could I use the file variable and don't create a new writer?

aitorru
  • 13
  • 4
  • Maybe useful https://stackoverflow.com/questions/67230394/can-i-capture-some-things-by-reference-and-others-by-value-in-a-closure – Peterrabbit Apr 23 '23 at 18:27
  • [Well it works for me](https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=68211eff0199b9b05e2aef9464613a27). – Chayim Friedman Apr 23 '23 at 18:30
  • Important Note: Your "working version" is undefined behavior on a operating system level. Try running [this playground](https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=29036726675cdeb5af1649759b1e8158) a couple of times to see what I mean. – cafce25 Apr 23 '23 at 18:54
  • @cafce25 I don't think it's UB, just that it may overwrite data. And with `append` it works. – Chayim Friedman Apr 23 '23 at 19:03
  • @ChayimFriedman even with `append` the output is [arbitrary interleaved](https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=9ed779e1c30d28ecbc3ea2bb23d5eecd), that's a race condition and thus UB. – cafce25 Apr 23 '23 at 19:25
  • 2
    @cafce25 That's a race condition but it's not UB, it is just that the data may interleave. – Chayim Friedman Apr 23 '23 at 19:25
  • @cafce25 Forgive my ignorance, but using the first comment solution the problem of race conditions would disappear? – aitorru Apr 23 '23 at 20:44
  • @cafce25 The word "undefined behavior" has a very specific technical meaning in languages like C, C++, and Rust. A behavior where something unpredictable may happen (like interleaved output) doesn't automatically imply UB. Data races are UB in Rust, but those are a subset of race conditions, and cannot happen in safe Rust, so this is not an example of one. – user4815162342 Apr 23 '23 at 22:08

1 Answers1

2

As demonstrated by Chayim Friedman in the comments, you don't really need to mutate a File in order to write to it. This is because &File implements Write, reflecting the fact that it's perfectly fine to write to an OS-level file handle from multiple threads. However, there are two issues with that approach:

  • there is no guarantee that <&File>::write_all() will be able to write everything in one go. If the underlying File::write() indicates that only a portion of the data has been written, it will issue a fresh write() to write out the rest. This write() might come after another thread has issued its own write()s, leading to interleaved (corrupted) data in the file.
  • the trick of writing to &file won't work if you decide to wrap your File into a BufWriter, which is reasonable if you're writing out a non-trivial amount of data. Likewise, it won't work for arbitrary IO sinks, which do require mut access to be written to.

Therefore I would recommend just using a Mutex, which fixes both issues (and which is what languages like C and C++ simply do under the hood):

let mut file = OpenOptions::new()
    .write(true)
    .append(true)
    .open("out.json")
    .unwrap();
file.write_all(b"Some data").unwrap();

// wrap file in a Mutex to use it from multiple threads
let file = Mutex::new(file);
folder_nas.par_iter().for_each(|_x| {
    // lock the mutex to write to the file
    file.lock().unwrap().write_all(b"Some filtered data").unwrap();
});
// extract the original file and use it as before
let mut file = file.into_inner().unwrap();

file.write_all(b"Some data").unwrap();

Playground

user4815162342
  • 141,790
  • 18
  • 296
  • 355
  • Technically, you can wrap `&File` in `BufWriter`, but this will likely make the problem of interleaved data. – Chayim Friedman Apr 25 '23 at 18:03
  • @ChayimFriedman Wrapping the `&File` in `BufWriter` doesn't help because that runs into the problem the OP had while trying to write directly to `File`. I.e. to write to the resulting `BufWriter`, the closure passed to `for_each()` would need to borrow it mutably, which would make it an `FnMut`, and `for_each()` expects `Fn`. – user4815162342 Apr 25 '23 at 18:48
  • I meant sharing the underlying `File` but giving each thread its own `BufWriter`. – Chayim Friedman Apr 25 '23 at 18:50
  • @ChayimFriedman Oh, right. Agreed, that will compile, but will run into the interleaving problem. In addition to that, `BufWriter` will fail to provide speedup if slow if individual iteration contains few writes (or just one, as shown in the question). – user4815162342 Apr 25 '23 at 19:00