7

So, I am trying to perform a sort of line-oriented operation on a gz compressed file bigger than available RAM, so reading it first into a string is excluded. The question is, how to do it in rust ( short of gunzip file.gz|./my-rust-program)?

My current solution is based on flate2 and a bunch of buffered readers:

use std::path::Path;
use std::io::prelude::*;
use std::io::BufReader;
use std::fs::File;
use flate2::bufread::GzDecoder as BufGzDecoder;
fn main() {
    let mut fname = "path_to_a_big_file.gz";
    let f = File::open(fname).expect("Ooops.");
    let bf = BufReader::new(f); // Here's the first reader so I can plug data into BufGzDecoder.
    let br = BufGzDecoder::new(bf); // Yep, here. But, oops, BufGzDecoder has not lines method,
                                    // so try to stick it into a std BufReader.
    let bf2 = BufReader::new(br); // What!? This works!? Yes it does.
    // After a long time ...
    eprintln!("count: {}",bf2.lines().count());
    // ... the line count is here.
}

To put the above into words, I noticed I cannot plug a file straight into the flate2::bufread::GzDecoder, so I first created the std::io::BufReader instance which is compatible with the constructor method of the former. But, I did not see any useful iterator associated with flate2::bufread::GzDecoder, so I built another std::io::BufReader on top of it. Surprisingly, that worked, I got my Lines iterator and it read the whole file in just over a minute on my machine, but feels like it's overly verbose and a inelegant as well as possibly inefficient (more worried about this part).

E_net4
  • 27,810
  • 13
  • 101
  • 139
Mali Remorker
  • 1,206
  • 11
  • 20

1 Answers1

8

Each "buffer-inducing" step described in the question is necessary here.

  1. The GZip decoder implementation requires a buffered reader as part of the decoding process. The buffer will be holding compressed data, through which new-line delimitation is not possible due to how GZip works.
  2. The second BufReader will then be used to identify the line separation pattern and accurately return complete lines of text.

However, there is a shortcut for the first one. The flate2 crate provides read::GzDecoder, which takes a regular reader and automatically employs buffered reading on it.

use flate2::read::GzDecoder;

let reader = BufReader::new(GzDecoder::new(file));

With this done, the recommended means of improving efficiency is to ensure that the program is built with the right profile (release mode) and to reuse the same String value for each line by using read_line instead of the lines() iterator, thus reducing the number of memory allocations.

See also:

E_net4
  • 27,810
  • 13
  • 101
  • 139
  • Thanks for the clear explanation. Btw, I did not notice any speedup from replacing the lines iterator construct with read_line based solution. I guess this only becomes apparent when the lines are long enough. I have millions of shortish lines. – Mali Remorker Jan 18 '21 at 18:54
  • Thanks for the answer. I just need to mention that "lines" method calls "read_line" function under the hood by implementing the "Iterator" trait: https://doc.rust-lang.org/stable/src/std/io/mod.rs.html#2808-2810 – Babak Karimi Bavandpour Dec 04 '22 at 15:09