The best advice is simply: Try different values and measure.
In parallelization there are sooo many factors that can affect the results: Disk I/O, shared CPU cache, and shared RAM bandwidth just to name three.
top
is your friend when measuring. If you can manage to get all CPUs to have <5% idle you are unlikely to go any faster - no matter what you do.
top - 14:49:10 up 10 days, 5:48, 123 users, load average: 2.40, 1.72, 1.67
Tasks: 751 total, 3 running, 616 sleeping, 8 stopped, 4 zombie
%Cpu(s): 17.3 us, 6.2 sy, 0.0 ni, 76.2 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
GiB Mem : 31.239 total, 1.441 free, 21.717 used, 8.081 buff/cache
GiB Swap: 117.233 total, 104.146 free, 13.088 used. 4.706 avail Mem
This machine is 76.2% idle. If your processes use loads of CPU then starting more processes in parallel here may help. If they use loads of disk I/O it may or may not help. Only way to know is to test and measure.
top - 14:51:00 up 10 days, 5:50, 124 users, load average: 3.41, 2.04, 1.78
Tasks: 759 total, 8 running, 619 sleeping, 8 stopped, 4 zombie
%Cpu(s): 92.8 us, 6.9 sy, 0.0 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
GiB Mem : 31.239 total, 1.383 free, 21.772 used, 8.083 buff/cache
GiB Swap: 117.233 total, 104.146 free, 13.087 used. 4.649 avail Mem
This machine is 0.1% idle. Starting more processes is unlikely to make things go faster.
So increase the parallelization until idle time hits a minimum or until average processing time hits a minimum (--joblog my.log
can be useful to see how long a job takes).
And yes: GNU Parallel is likely to speed-up bioinformatics (being written by a fellow bioinformatician).
Consider reading GNU Parallel 2018 (paper: http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html download: https://doi.org/10.5281/zenodo.1146014) Read at least chapter 1+2. It should take you less than 20 minutes. Your command line will love you for it.