3

I want to generate a large file of pseudo-random ASCII characters given the parameters: size per line and number of lines. I cannot figure out a way to do this without allocating new Strings for each line. This is what I have: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=42f5b803910e3a15ff20561117bf9176

use rand::{Rng, SeedableRng};
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    let mut data: Vec<u8> = Vec::new();
    write_random_lines(&mut data, 10, 10)?;
    println!("{}", std::str::from_utf8(&data)?);
    Ok(())
}

fn write_random_lines<W>(
    file: &mut W,
    line_size: usize,
    line_count: usize,
) -> Result<(), Box<dyn Error>>
where
    W: std::io::Write,
{
    for _ in 0..line_count {
        let mut s: String = rand::rngs::SmallRng::from_entropy()
            .sample_iter(rand::distributions::Alphanumeric)
            .take(line_size)
            .collect();
        s.push('\n');
        file.write(s.as_bytes())?;
    }
    Ok(())
}

I'm creating a new String every line, so I believe this is not memory efficient. There is fn fill_bytes(&mut self, dest: &mut [u8]) but this is for bytes.

I would preferably not create a new SmallRng for each line, but it is used in a loop and SmallRng cannot be copied.

How can I generate a random file in a more memory and time efficient way?

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
MakotoE
  • 1,814
  • 1
  • 20
  • 39

3 Answers3

3

This modification of your code does not allocate any Strings and also does not construct a new SmallRng each time, but I have not benchmarked it:

fn write_random_lines<W>(
    file: &mut W,
    line_size: usize,
    line_count: usize,
) -> Result<(), Box<dyn Error>>
where
    W: std::io::Write,
{
    // One random data iterator.
    let mut rng_iter = rand::rngs::SmallRng::from_entropy()
        .sample_iter(rand::distributions::Alphanumeric);

    // Temporary storage for encoding of chars. If the characters used
    // are not all ASCII then its size should be increased to 4.
    let mut char_buffer = [0; 1];

    for _ in 0..line_count {
        for _ in 0..line_size {
            file.write(
                rng_iter.next()
                    .unwrap()  // iterator is infinite so this never fails
                    .encode_utf8(&mut char_buffer)
                    .as_bytes())?;
        }
        file.write("\n".as_bytes())?;
    }
    Ok(())
}

I am new to Rust so it may be missing some ways to tidy it up. Also, note that this writes only one character at a time; if your W is more expensive per operation than an in-memory buffer, you probably want to wrap it in std::io::BufWriter, which will batch writes to the destination (using a buffer that needs to be allocated, but only once).

Kevin Reid
  • 37,492
  • 13
  • 80
  • 108
  • Thanks for the idea. I tested this and it seems it was not any better. Please see my edit for the result. – MakotoE Sep 02 '20 at 03:10
  • My bad, I made a dumb mistake. Your's is faster but with same amount of allocations. It might be due to the fact that the iterator is reused in your method. – MakotoE Sep 02 '20 at 03:16
  • I wonder why memory allocation is the same for both though. I expected mine to be much worse. – MakotoE Sep 02 '20 at 03:22
  • @Makoto Your procedure almost certainly deallocates and reallocates the same small buffer over and over again, so while the allocator is churning a *lot*, it never needs more than a hundred bytes at once (not counting the 100MB buffer which is there in both cases). Not sure where the extra allocation is coming from in Kevin's case but I am always suspicious of measurement problems. – trent Sep 02 '20 at 06:00
  • @trentcl I was putting all the generated bytes into a `Vec` instead of writing to a file, so that answers it. – MakotoE Sep 02 '20 at 23:14
3

You can easily reuse a String in a loop by creating it outside the loop and clearing it after using the contents:

    // Use Kevin's suggestion not to make a new `SmallRng` each time:
    let mut rng_iter =
        rand::rngs::SmallRng::from_entropy().sample_iter(rand::distributions::Alphanumeric);
    let mut s = String::with_capacity(line_size + 1);  // allocate the buffer
    for _ in 0..line_count {
        s.extend(rng_iter.by_ref().take(line_size));   // fill the buffer
        s.push('\n');
        file.write(s.as_bytes())?;                     // use the contents
        s.clear();                                     // clear the buffer
    }

String::clear erases the contents of the String (dropping if necessary), but does not free its backing buffer, so it can be reused without needing to reallocate.

See also

trent
  • 25,033
  • 7
  • 51
  • 90
  • Thank you; I decided to use this with `BufWriter` since it's the simplest and straightforward way. In terms of speed, this was 15% faster than Kevin's. `by_ref()` is going to be super handful in loops. – MakotoE Sep 02 '20 at 23:17
0

I (MakotoE) benchmarked Kevin Reid's answer, and it seems their method is faster though memory allocation seems to be the same.

Benchmarking time-wise:

#[cfg(test)]
mod tests {
    extern crate test;
    use test::Bencher;
    use super::*;

    #[bench]
    fn bench_write_random_lines0(b: &mut Bencher) {
        let mut data: Vec<u8> = Vec::new();
        data.reserve(100 * 1000000);
        b.iter(|| {
            write_random_lines0(&mut data, 100, 1000000).unwrap();
            data.clear();
        });
    }

    #[bench]
    fn bench_write_random_lines1(b: &mut Bencher) {
        let mut data: Vec<u8> = Vec::new();
        data.reserve(100 * 1000000);
        b.iter(|| {
            // This is Kevin's implementation
            write_random_lines1(&mut data, 100, 1000000).unwrap();
            data.clear();
        });
    }
}
test tests::bench_write_random_lines0 ... bench: 764,953,658 ns/iter (+/- 7,597,989)
test tests::bench_write_random_lines1 ... bench: 360,662,595 ns/iter (+/- 886,456)

Benchmarking memory usage using valgrind's Massif shows that both are about the same. Mine used 3.072 Gi total, 101.0 MB at peak level. Kevin's used 4.166 Gi total, 128.0 MB peak.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366