12

This is my script:

#!/bin/bash
#script to loop through directories to merge fastq files
sourcedir=/path/to/source
destdir=/path/to/dest

for f in $sourcedir/*
do
    fbase=$(basename "$f")
    echo "Inside $fbase"
    zcat $f/*R1*.fastq.gz | gzip > $destdir/"$fbase"_R1.fastq.gz
    zcat $f/*R2*.fastq.gz | gzip > $destdir/"$fbase"_R2.fastq.gz
done

Here there are about 30 sub-directories in the directory 'source'. Each sub-directory has certain R1.fastq.gz files and R2.fastq.gz that I want to merge into one R1.fastq.gz and R2.fastq.gz file, then save the merged file to the destination directory. My code works fine but I need to speed it up because of the amount of data. I just want to know is there any way I can implement multi threaded programming in my script? How can I run my script so that multiple jobs run in parallel? New to bash scripting, so any help would be appreciated.

Leandro Papasidero
  • 3,728
  • 1
  • 18
  • 33
Komal Rathi
  • 4,164
  • 13
  • 60
  • 98
  • Since you are clearly dealing with bioinformatics you should read these: http://www.biostars.org/p/81359/ http://www.biostars.org/p/63816/ – Ole Tange Jan 02 '14 at 14:16

2 Answers2

9

The simplest way is to execute the commands in the background, by adding & to the end of the command:

#!/bin/bash
#script to loop through directories to merge fastq files
sourcedir=/path/to/source
destdir=/path/to/dest

for f in $sourcedir/*
do
    fbase=$(basename "$f")
    echo "Inside $fbase"
    zcat $f/*R1*.fastq.gz | gzip > $destdir/"$fbase"_R1.fastq.gz &
    zcat $f/*R2*.fastq.gz | gzip > $destdir/"$fbase"_R2.fastq.gz &
done

From the bash manual:

If a command is terminated by the control operator ‘&’, the shell executes the command asynchronously in a subshell. This is known as executing the command in the background. The shell does not wait for the command to finish, and the return status is 0 (true). When job control is not active (see Job Control), the standard input for asynchronous commands, in the absence of any explicit redirections, is redirected from /dev/null.

Community
  • 1
  • 1
Zero Piraeus
  • 56,143
  • 27
  • 150
  • 160
  • I was actually referring to other stackoverflow questions regarding this and they mention something like pid and wait and sort of things. So is adding an & at the end of the command an efficient way of parallelizing your jobs? – Komal Rathi Aug 22 '13 at 15:33
  • 1
    @user2703967 yes ... adding `&` spawns a new subshell which just goes away and does its thing while your script continues. If you need anything more sophisticated than that, you probably shouldn't be using bash in the first place. – Zero Piraeus Aug 22 '13 at 15:36
  • Thanks, one last question. When I use "wait" after "done", what difference does it make? – Komal Rathi Aug 22 '13 at 15:39
  • @user2703967 It'll wait for the background jobs to finish, and then carry on. For your example script, that makes no difference as the script is finished at that point anyway - if you want to do things with the result of the background jobs, you will need it. – Zero Piraeus Aug 22 '13 at 15:42
  • Thanks!! I will still need to look into an alternative to take advantage of the multiple processors that my computer has. Any ideas? – Komal Rathi Aug 22 '13 at 15:52
  • 1
    @user2703967 just let your OS deal with it (which it will). Unless you're doing stuff far too complex to even think about doing in bash, this really isn't a concern. – Zero Piraeus Aug 22 '13 at 16:01
  • 2
    Dude this was pure genius. And so OBVIOUS! Wow. Thanks a lot. – Robert Beltran Aug 09 '14 at 00:47
3

I am not sure but you can try using & at the end of the command like this

zcat $f/*R1*.fastq.gz | gzip > $destdir/"$fbase"_R1.fastq.gz &
zcat $f/*R2*.fastq.gz | gzip > $destdir/"$fbase"_R2.fastq.gz &
tejas
  • 1,795
  • 1
  • 16
  • 34