3

I've written a few scripts for processing FASTA/FASTQ files (e.g. fastx-length.pl), but would like to make them more generic and accept both compressed and uncompressed files as both command line parameters and as standard input (so that the scripts "just work" when you throw random files at them). It's quite common for me to be doing work on both uncompressed and compressed files (e.g. compressed read files, uncompressed assembled genomes), and slotting in things like <(zcat file.fastq.gz) gets quickly annoying.

Here's an example chunk from the fastx-length.pl script:

...
my @lengths = ();
my $inQual = 0; # false
my $seqID = "";
my $qualID = "";
my $seq = "";
my $qual = "";
while(<>){
  chomp; chomp; # double chomp for Windows CR/LF on Linux machines
  if(!$inQual){
    if(/^(>|@)((.+?)( .*?\s*)?)$/){
      my $newSeqID = $2;
      my $newShortID = $3;
      if($seqID){
        printf("%d %s\n", length($seq), $seqID);
        push(@lengths, length($seq));
      }
...

I can see that IO::Uncompress::Gunzip supports transparent uncompression via :

If this option is set and the input file/buffer is not compressed data, the module will allow reading of it anyway.

In addition, if the input file/buffer does contain compressed data and there is non-compressed data immediately following it, setting this option will make this module treat the whole file/buffer as a single data stream.

I would like to basically slot a transparent uncompress into the diamond operator, between loading each file and reading a line from the file inputs. Does anyone know how I can do this?

gringer
  • 410
  • 4
  • 13
  • 1
    Why are you calling chomp twice? Isn't the second chomp redundant? – winni2k Jun 11 '17 at 20:02
  • It's for removing CR/LF on windows-formatted files when they are used on a Linux system (which I do from time to time, because most of the people I collaborate with have Windows computers). See [here](http://www.perlmonks.org/?node_id=830687). – gringer Jun 11 '17 at 21:17
  • Lol. I should really get out more... – winni2k Jun 11 '17 at 22:55

3 Answers3

5

I often use:

die("Usage: prog.pl [file [...]]\n") if @ARGV == 0 && -t STDIN;
push(@ARGV, "-") unless @ARGV;
for my $fn (@ARGV) {
    open(FH, $fn =~ /\.gz$/? "gzip -dc $fn |" : $fn =~ /\.bz2$/? "bzip2 -dc $fn |" : $fn) || die;
    print while (<FH>);
    close(FH);
}

This strategy only works when you have gzip etc and name files with proper file extensions, but once you meet these requirements, it works with a variety of file types at the same time. As to -t STDIN, see explanation here.

user172818
  • 4,518
  • 1
  • 18
  • 20
2

This is something I have wanted to do for a long time as well. It's only recently that I learned how to do it robustly.

The approach does not require any file naming conventions. Instead it checks the gzip magic number, which is 0x1f8b. It requires reading the first two bytes of each file as a binary stream (using a really nifty function called unpack), and checking to see if the bytes match gzip's magic number. This seems to work for me:

$ echo "hi world" | gzip -c > hi_world.gz
$ echo "hi world" > hi_world.txt
$ echo "hi world" | gzip -c > not_a_gz_file
$ perl testgz.pl hi_world.gz hi_world.txt not_a_gz_file
hi_world.gz is gzipped!
hi_world.txt is not gzipped :(
not_a_gz_file is gzipped!

The contents of testgz.pl is below. Please excuse my perl. It's been a while...

# testgz.pl
my $GZIP_MAGIC_NUMBER = "1f8b";
my $GZIP_MAGIC_NUMBER_LENGTH = 2; # in bytes

for my $arg (@ARGV){
    if(is_gzipped($arg)){
        print "$arg is gzipped!\n";
    } else{
        print "$arg is not gzipped :(\n";
    }
}


sub is_gzipped{
    my $file_name = shift;
    open(my $fh, "<", $file_name)
      or die "Can't open < $file_name: $!";
    read($fh, $line, $GZIP_MAGIC_NUMBER_LENGTH);
    close($fh);
    return is_line_gzipped($line);
}

sub is_line_gzipped{
    my $line = shift;
    my $is_gzipped = 0;
    if (length($line) >= $GZIP_MAGIC_NUMBER_LENGTH){
        my $magic_number = unpack("H4", $line);
        $is_gzipped = 1 if($magic_number == $GZIP_MAGIC_NUMBER);
    }
    return $is_gzipped
}

In answer to the question, I would suggest checking a file you are about to open with the function is_gzipped, and then choosing an approach based on the result.

winni2k
  • 1,460
  • 16
  • 19
  • Unfortunately I can't assume that a file can be reopened. The "files" provided might be data from a stream, so any bytes read to detect magic numbers in a file will need to be stored in a sampling buffer. – gringer Jun 11 '17 at 21:24
  • Right. If you can get a hold of the initial buffer, then you should be able to peek at the first two bytes and feed them to `is_line_gzipped`. However, then you still have the problem of unzipping the stream with something like open2 if I am not mistaken? – winni2k Jun 11 '17 at 23:00
0

I think what I'm mostly struggling with is teasing apart the different bits of the diamond operator. I found some help in the Compress::Zlib documentation that seemed close to what I wanted to do, except it tries to uncompress everything (ending up with rubbish output for uncompressed files):

use strict ;
use warnings ;
use Compress::Zlib ;

# use stdin if no files supplied
@ARGV = '-' unless @ARGV ;

foreach my $file (@ARGV) {
    my $buffer ;

    my $gz = gzopen($file, "rb") 
         or die "Cannot open $file: $gzerrno\n" ;

    print $buffer while $gz->gzread($buffer) > 0 ;

    die "Error reading from $file: $gzerrno" . ($gzerrno+0) . "\n" 
        if $gzerrno != Z_STREAM_END ;

    $gz->gzclose() ;
}

Here's my modification to change to IO::Uncompress::Gunzip and get the transparent uncompression working:

#!/usr/bin/perl
use strict;
use warnings;

use IO::Uncompress::Gunzip qw(gunzip $GunzipError);

# use stdin if no files supplied
@ARGV = '-' unless @ARGV

foreach my $file (@ARGV) {
    my $z = new IO::Uncompress::Gunzip($file, "transparent", 1)
        or die "gunzip failed: $GunzipError\n";
    while(<$z>){
        print;
    }
    close($z);
}

This seems to work for just reading and writing files (i.e. like zcat), but I have yet to test it out in my scripts.

gringer
  • 410
  • 4
  • 13