4

I think I have a fairly basic question. I just discovered the GNU parallel package and I think my workflow can really benefit from it! I am using a loop which loops through my read files and generates the desired output. The command that is excecuted for each read looks something like this:

STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn R1.fq R2.fq

As you can see I specified 8 threads, which is the amount of threads my virtual machine has.

My question now is this following: If I use GNU parallel with a command like this:

cat reads| parallel -j 3 STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn {}_R1.fq {}_R2.fq

Can my virtual machine handle the number of threads I specified, if I execute 3 jobs in parallel?

Or do I need 24 threads (3*8 threads) to properly excecute this command?

Im sorry if this is a basic question, I am very new to the field and any help is much appreciated!

nhaus
  • 786
  • 3
  • 13
  • If you look at your VM configuration it most likely has a number of **cores** allocated to it rather than **threads** which are more a software construct. You can run more threads than you have CPU cores if your task is more I/O-bound but if your task is compute-bound there is little point. – Mark Setchell May 05 '20 at 11:35

1 Answers1

3

The best advice is simply: Try different values and measure.

In parallelization there are sooo many factors that can affect the results: Disk I/O, shared CPU cache, and shared RAM bandwidth just to name three.

top is your friend when measuring. If you can manage to get all CPUs to have <5% idle you are unlikely to go any faster - no matter what you do.

top - 14:49:10 up 10 days,  5:48, 123 users,  load average: 2.40, 1.72, 1.67
Tasks: 751 total,   3 running, 616 sleeping,   8 stopped,   4 zombie
%Cpu(s): 17.3 us,  6.2 sy,  0.0 ni, 76.2 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
GiB Mem :   31.239 total,    1.441 free,   21.717 used,    8.081 buff/cache
GiB Swap:  117.233 total,  104.146 free,   13.088 used.    4.706 avail Mem 

This machine is 76.2% idle. If your processes use loads of CPU then starting more processes in parallel here may help. If they use loads of disk I/O it may or may not help. Only way to know is to test and measure.

top - 14:51:00 up 10 days,  5:50, 124 users,  load average: 3.41, 2.04, 1.78
Tasks: 759 total,   8 running, 619 sleeping,   8 stopped,   4 zombie
%Cpu(s): 92.8 us,  6.9 sy,  0.0 ni,  0.1 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
GiB Mem :   31.239 total,    1.383 free,   21.772 used,    8.083 buff/cache
GiB Swap:  117.233 total,  104.146 free,   13.087 used.    4.649 avail Mem 

This machine is 0.1% idle. Starting more processes is unlikely to make things go faster.

So increase the parallelization until idle time hits a minimum or until average processing time hits a minimum (--joblog my.log can be useful to see how long a job takes).

And yes: GNU Parallel is likely to speed-up bioinformatics (being written by a fellow bioinformatician).

Consider reading GNU Parallel 2018 (paper: http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html download: https://doi.org/10.5281/zenodo.1146014) Read at least chapter 1+2. It should take you less than 20 minutes. Your command line will love you for it.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • Thank you for that answer! I have, what I think is a really basic follow up question: If I run this: ``STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn R1.fq R2.fq`` and do ``top`` to check what is happening, it says that it ``STAR`` takes up 800% of my CPU. Does that mean that that each of my threads is completly busy (8*100%)? Does that in turn mean, if i use GNU parallel and 3 jobs get started with the same name (e.g. HISAT2) but each of these jobs only takes ~135%, does that mean that half of my CPU is idle, because 3*135 =405? Thanks alot! – nhaus May 07 '20 at 12:21
  • The 3rd line in `top` shows `%Cpu(s): 8.4 us, 3.7 sy, 0.0 ni, 86.9 id, 1.0 wa, 0.0 hi, 0.0 si, 0.0 st`. Here 86.9 means that 86.9% of my CPU power is idle. If that is <5% then it is unlikely that parallelizing more will help. – Ole Tange May 07 '20 at 12:45