Merge fastq.gz files in Unix

Question

I am using this script for concatenating my reads from the Samples.Each sub-directory has certain R1.fastq.gz files and R2.fastq.gz that I want to merge into one R1.fastq.gz and R2.fastq.gz file.

sourcedir=/sourcepath/
destdir=/destinationpath/

for f in $sourcedir/*
do
  fbase=$(basename "$f")
  echo "Inside $fbase"
  zcat $f/*R1*.fastq.gz | gzip >$destdir/"$fbase"_R1.fastq.gz 
  zcat $f/*R2*.fastq.gz | gzip >$destdir/"$fbase"_R2.fastq.gz

done

I want to validate that the reads from R1,R2 are concatenated respectively by comparing the total lines from individual fastq.gz files and the total lines in merged file.

 wc -l *R1*.fastq.gz (Individual files)
 12832112 total

 wc -l Sample_51770BL1_R1.fastq.gz  (merged file)
 Total:10397604

Should not the number be equal in both cases,or is there any other way to validate that the files merged are done correctly?

Also, is there any way to fasten the process?I tried using & from this link How do I use parallel programming/multi threading in my bash script? but its not running at all.

zcat $f/*R1*.fastq.gz | gzip >$destdir/"$fbase"_R1.fastq.gz &
zcat $f/*R2*.fastq.gz | gzip >$destdir/"$fbase"_R2.fastq.gz &

score 5 · Accepted Answer · answered Dec 23 '13 at 22:38

5

You're running wc -l on the .gz files, which is not what you want. To verify, you can use something like this instead:

zcat *R1*.fastq.gz | wc -l
zcat Sample_51770BL1_R1.fastq.gz | wc -l

Although you might want to use a proper checksum algorithm, e.g. with the sha256sum tool, for that.

As for parallelising, you can parallelise the decompression but not the compression, since you're writing the things into one stream (file) one after the other. For example like this:

sourcedir=/sourcepath/
destdir=/destinationpath/

for f in $sourcedir/*; do
        fbase=${f##*/}
        echo "Inside $fbase"
        for R in 1 2; do
                for xf in $f/*R$R*.fastq.gz; do
                        gzip -dc <$xf >${xf%.gz} &
                done
                wait
                cat $f/*R$R*.fastq | gzip -n9 >$destdir/"$fbase"_R$R.fastq.gz
                rm -f $f/*R$R*.fastq
        done
done

The problem with this approach is that you need to write the intermediate decompression results to disc (or other temporary storage), which, in general, is slower than not parallelising the decompression (much). Also, you cannot parallelise between R1 and R2 that way.

Another option is this, parallelising between the Rs and the fs only (from the stomach feeling, this should give the best results achievable without bending over backwards too much):

sourcedir=/sourcepath/
destdir=/destinationpath/

for f in $sourcedir/*; do
        fbase=${f##*/}   
        echo "Inside $fbase"
        for R in 1 2; do
                zcat $f/*R$R*.fastq.gz | gzip -n9 >$destdir/"$fbase"_R$R.fastq.gz &
        done
done
wait

Hope this helps!

answered Dec 23 '13 at 22:38

mirabilos

5,123
2
46
72

51770BL1_GATCAG_L001_R1_001.fastq.gz 51770BL1_GATCAG_L001_R1_002.fastq.gz These are 2 examples of files that I am merging so, using regex as /*R1*/, but did not get /*R$R*/ in your case, and R$R in the second code? – Rgeek Dec 23 '13 at 22:43
Ok,but what does * R$R *.fastq.gz mean as my file is 51770BL1_GATCAG_L001_R1_001.fastq.gz? – Rgeek Dec 24 '13 at 03:26
Does,this for loop runs all the commands ,putting on the same processor or different ?The reason I ask this is,I am running this script sourcedir=/sourcepath/ destdir=/destinationpath/ for fname in *_R1.fastq.gz; do base=${fname%_R1*} bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" > "$destdir/${base}_R1_R2.sam" & done wait .Here for loop is running out of memory,since it puts these processes all on one processor. – Rgeek Jan 07 '14 at 20:33
I have access to the titan cluster which is consisting of 464 HP blade systems, two head nodes and a virtualized pool of login (submit) nodes.Each node has eight cores (two quad-core processors), and either 16GB (430 nodes) or 32GB (34 nodes) of memory. This provices 3712 compute cores and 8 TB of total RAM (memory).So,do I need an inner for loop as well,in my script? – Rgeek Jan 07 '14 at 20:40
Oh wow. But: How are processes scheduled on that? Normally, on Linux at least, separate blades run separate OSes, so this will only ever run on one blade, so only 8 cores. – mirabilos Jan 08 '14 at 10:38
I am not really sure about how this works,but if we have so many jobs,we should have it running each one of them on a separate processor,so that it does not have to wait for one job to finish,and each job can run parallel.I think array job is the one that should be able to do this.The for loop however worked for zcat,as many parallel jobs ran while merging the fastq.gz files – Rgeek Jan 08 '14 at 15:42
Can you elaborate on actual scheduling of the processes?I did not get that – Rgeek Jan 08 '14 at 16:02
That is system-specific. Get this information from the administrators of your “titan cluster”. – mirabilos Jan 08 '14 at 16:03
gzip -n9 - Does it parallelize the process ?If I dont want to parallelize, Can I just remove -n9 ? – Rgeek Apr 27 '17 at 21:29
No, it does **not** parallelise the process, it ensures maximum compression and minimum protocol overhead. [RTFM](http://www.mirbsd.org/man1/gzip) for more information. While you _could_, in theory, remove it, that’d degrade things. – mirabilos Apr 28 '17 at 04:07

Merge fastq.gz files in Unix

1 Answers1