I've written a few scripts for processing FASTA/FASTQ files (e.g. fastx-length.pl), but would like to make them more generic and accept both compressed and uncompressed files as both command line parameters and as standard input (so that the scripts "just work" when you throw random files at them). It's quite common for me to be doing work on both uncompressed and compressed files (e.g. compressed read files, uncompressed assembled genomes), and slotting in things like <(zcat file.fastq.gz)
gets quickly annoying.
Here's an example chunk from the fastx-length.pl
script:
...
my @lengths = ();
my $inQual = 0; # false
my $seqID = "";
my $qualID = "";
my $seq = "";
my $qual = "";
while(<>){
chomp; chomp; # double chomp for Windows CR/LF on Linux machines
if(!$inQual){
if(/^(>|@)((.+?)( .*?\s*)?)$/){
my $newSeqID = $2;
my $newShortID = $3;
if($seqID){
printf("%d %s\n", length($seq), $seqID);
push(@lengths, length($seq));
}
...
I can see that IO::Uncompress::Gunzip
supports transparent uncompression via :
If this option is set and the input file/buffer is not compressed data, the module will allow reading of it anyway.
In addition, if the input file/buffer does contain compressed data and there is non-compressed data immediately following it, setting this option will make this module treat the whole file/buffer as a single data stream.
I would like to basically slot a transparent uncompress into the diamond operator, between loading each file and reading a line from the file inputs. Does anyone know how I can do this?