0

I'm trying to read a gzip-compressed file line by line.

I used the method suggested in this post. It works fine for the first ~700 lines of the file, but then stops without error and ignores the next millions of lines.

Here is a minimal working example (Rust 1.57.0):

use std::io::{prelude::*, BufReader};
use std::fs::File;
use flate2; // 1.0
use flate2::read::GzDecoder;

fn main() {
    let r1 = "/home/path/to/bigfile.gz";
    let file = File::open(r1).unwrap();
    let reader = BufReader::new(GzDecoder::new(file));
    let mut i = 0;
    for l in reader.lines() {
        println!("{}", i);
        i+=1;
    }
}

Since this code compiles and is able to read the start of the file, why does it stop at some point?

E_net4
  • 27,810
  • 13
  • 101
  • 139
Lurk
  • 86
  • 5
  • 1
    Cannot reproduce the issue with a text file containing two million lines and that exact code. What may help diagnose the problem is listen to the compiler warning and handle the `Result` in `l`. – E_net4 Feb 23 '22 at 15:33
  • @E_net4thecurator Thanks for your response, the compiler warning refers to the unused variable l, it is solved with ```println!("{} {}", i, l.unwrap());``` but the issue remains the same. I assume there is something wrong with my file but I don't know how to test it. – Lurk Feb 23 '22 at 15:38
  • 1
    Two things come to mind: 1) give more details (rustc version, and the specified version of the `flate2` crate); 2) try a different file, and provide a way to produce one that reproduces the problem. – E_net4 Feb 23 '22 at 15:46
  • @E_net4thecurator So, for the versions, I have rustc 1.57.0 (f1edd0429 2021-11-29) and flate2 = "1.0". Regarding the file, I can't share it since it's medical data. To give more details about the files, they are fastq.gz files countaining sequencing reads and they all fail at the same line. I have 2 types of files, one with indexes, and others with reads. The reads make up for longer lines and fail earlier than the indexes. – Lurk Feb 23 '22 at 15:59

1 Answers1

2

I found the issue, my files where not gzip encoded but bgzip encoded, meaning the flate2 parser thought the end of one bgzip block was the end of the file.

The solution is to use rust_htslib::bgzf::Reader like this :

let r1_reader = BufReader::new(Reader::from_path(r1).unwrap());
Lurk
  • 86
  • 5
  • I had the same issue and circumvented it by converting the file from bgzip to gzip via cmd line. Thought it was too heavy to include htslib if it was not already included as a dependency. Is it a bug in flate2? – gbinux Sep 25 '22 at 15:52
  • 1
    @gbinux I don't think that it's a bug in flate2, but simply a question of unsupported file format. From what I understand, bgzip compress a file in blocks instead of one single chunk, this is why my original piece of code worked partially until the end of the first block. – Lurk Sep 26 '22 at 08:40
  • Thanks for this. Your solution also solved the same problem I was having where I didn't realize the file was compressed with bgzip instead of gzip. Looks like there is also a [bgzip](https://docs.rs/bgzip/0.2.1/bgzip/) crate that could work. – Jon Chung Nov 09 '22 at 15:20
  • There's a good chance you that you need flate2's "MultiGzipDecoder". Apparently, gzip spec allows multiple gzips to be written sequentially in a single stream. https://docs.rs/flate2/latest/flate2/bufread/struct.MultiGzDecoder.html – michael_j_ward May 30 '23 at 20:15