Bash script processing limited number of commands in parallel

Question

I have a bash script that looks like this:

#!/bin/bash
wget LINK1 >/dev/null 2>&1
wget LINK2 >/dev/null 2>&1
wget LINK3 >/dev/null 2>&1
wget LINK4 >/dev/null 2>&1
# ..
# ..
wget LINK4000 >/dev/null 2>&1

But processing each line until the command is finished then moving to the next one is very time consuming, I want to process for instance 20 lines at once then when they're finished another 20 lines are processed.

I thought of wget LINK1 >/dev/null 2>&1 & to send the command to the background and carry on, but there are 4000 lines here this means I will have performance issues, not to mention being limited in how many processes I should start at the same time so this is not a good idea.

One solution that I'm thinking of right now is checking whether one of the commands is still running or not, for instance after 20 lines I can add this loop:

while [  $(ps -ef | grep KEYWORD | grep -v grep | wc -l) -gt 0 ]; do
sleep 1
done

Of course in this case I will need to append & to the end of the line! But I'm feeling this is not the right way to do it.

So how do I actually group each 20 lines together and wait for them to finish before going to the next 20 lines, this script is dynamically generated so I can do whatever math I want on it while it's being generated, but it DOES NOT have to use wget, it was just an example so any solution that is wget specific is not gonna do me any good.

`wait` is the right answer here, but your `while [ $(ps …` would be much better written `while pkill -0 $KEYWORD…` – using [proctools](http://sourceforge.net/projects/proctools/)… that is, for legitimate reasons to check if a process with a specific name is still running. — kojiro, Oct 23 '13 at 13:46
I think this question should be re-opened. The "possible duplicate" QA is all about running a _finite_ number of programs in parallel. Like 2-3 commands. This question, however, is focused on running commands in e.g. a loop. (see "but there are 4000 lines"). — VasiliNovikov, Jan 11 '18 at 19:01
@VasyaNovikov Have you *read* **all** the answers to both this question and the duplicate? Every single answer to this question here, can also be found in the answers to the duplicate question. That is *precisely* the definition of a duplicate question. It makes absolutely no difference whether or not you are running the commands in a loop. — robinCTS, Jan 11 '18 at 23:08
@robinCTS there are intersections, but questions themselves are different. Also, 6 of the most popular answers on the linked QA deal with 2 processes only. — VasiliNovikov, Jan 12 '18 at 04:09
@robinCTS to address the "in loop" question. Yes, it does matter whether you run 4000 programs in loop or only 2 programs in parallel. In case of 4000, you can't start them all in parallel, as has been explained in detail on this question. — VasiliNovikov, Jan 12 '18 at 04:14
@VasyaNovikov There are so many points about your comments that need addressing that I could write up a decent meta question/answer pair. I will, however, try to compress them down to fit in a couple of comments, or so ;) [in general]: 1) You have misunderstood the concept of duplicates on Stack Overflow. It doesn't matter if the questions are "different". If an answer to the target question (TQ) can be *used* to solve the current question (CQ), then the CQ is a duplicate of the TQ. [in this case]: 2) The questions are actually identical! From the TQ's title… — robinCTS, Jan 12 '18 at 17:24
… *"How do you run multiple programs in parallel from a bash script?"* and first sentence *"I am trying to write a .sh file that runs many programs simultaneously"* we can extract the key words **"bash run many programs in parallel"**. From the CQ's title *"Bash script processing commands in parallel"* and third sentence *"…there are 4000 lines…"* we get **"bash process 4000 commands in parallel"**. So, just a quick look at the questions shows that they must be duplicates of each other! [re your first comment]: 3) **If** you are claiming *"running … programs "* is different to… — robinCTS, Jan 12 '18 at 17:24
… *"running commands"*, that is incorrect. Running a program *is* a command, and a solution for the former will work for the latter. (There's confusion in your comment, as you use programs vs commands in the comparison and yet interchange programs and commands in the TQ part.) 4) Claiming TQ says *"finite … Like 2-3 commands"* and thus is different to CQ's *"there are 4000"* is incorrect as I have already shown in point 2 above. TQ actually says *"many"*. The example only shows 2 programs for simplicity's (MCVE) sake. 5) Claiming TQ is about *"in parallel"* whilst CQ is about… — robinCTS, Jan 12 '18 at 17:24
… *"in e.g. a loop"* is incorrect as I've partly shown in point 2. Both are about running in parallel. As for loops, **the CQ has nothing to do with loops**! The only mention of loops is when the OP suggests a possible solution using a *sleep* loop. Even if the CQ/example script used loops, every loop can be unrolled, and every flat set of commands can be rolled up into a loop (see TQ's [A#7](//stackoverflow.com/a/33106658)). [re your second comment]: 6) The questions are not just similar, but identical, as I've already shown. 7) It is irrelevant that six of the TQ answers only deal with… — robinCTS, Jan 12 '18 at 17:24
…two "commands". It is trivial to extend them for more commands. [re your third comment]: 8) The comparison between *"4000"* & *"2"*, and between *"loop"* & *"parallel"* is invalid as I have previously shown. 9) Finally we get to the only valid *technical* difference between the two questions. The CQ *explicitly* states the performance/limited process issues, whilst for the TQ, these issues are not explicitly stated in the question itself, but are addressed in **some of the answers**. Wait a minute… Let me repeat that so we are all clear:… — robinCTS, Jan 12 '18 at 17:25
… **the only (technical) extra requirement you can justify is actually addressed in the TQ answers**! There's at least three different solutions to the performance/process issue in the TQ: `wait`, `parallel`, and `xargs`. For `wait`, the CQ's [accepted answer](//stackoverflow.com/a/19543185) and [A#3](//stackoverflow.com/a/19543339) can be derived from TQ's [A#7](//stackoverflow.com/a/33106658), [A#8](//stackoverflow.com/a/41762802) & [A#9](//stackoverflow.com/a/42098494); for `parallel`, CQ's [A#2](//stackoverflow.com/a/19543286) is the same but less useful than… — robinCTS, Jan 12 '18 at 17:25
…TQ's [A#5](//stackoverflow.com/a/3018124); and for `xargs`, CQ's [A#4](//stackoverflow.com/a/38047372) is no better than TQ's [A#4](//stackoverflow.com/a/3014583) & [A#10](//stackoverflow.com/a/44124618). In other words, duplicate questions *and* duplicate answers! — robinCTS, Jan 12 '18 at 17:25
@robinCTS I still think that running 2 commands and 4000 are two completely different tasks, because most solutions for one problem cannot be used on the other problem. I have no other justification than I already wrote, explaining further would not help. — VasiliNovikov, Jan 12 '18 at 21:42
@VasyaNovikov You might be right that, in general, a 2-command solution cannot be used for a 4000-command one, or vice versa. Without a lot of data on a lot of questions I wouldn't be able to hazard a guess as to whether it is or is not. However, duplicates are resolved on the merits of the two actual questions, ***not*** for the general case. On top of which, as I have already proven, in this case it is not even true that one question is about 2 commands and the other 4000. Whilst most of the TQ's answers *are* demonstrated by using two commands, #4 uses the term *"a batch of"*, #7 states… — robinCTS, Jan 13 '18 at 00:03
…that it can be used for more, #9 demonstrates the case for 9 in total, with a maximum of 4 in parallel, and implies that the 4 can be increased, and #10 demonstrates 3 commands, also implying that this can be increased. *All* of the TQ's answers can easily be extended for a lot more commands. If you look at the CQ's answers, only one has an actual example (for 8 commands). By your definition, none of these actually answer the 4000-command CQ! — robinCTS, Jan 13 '18 at 00:03
@VasyaNovikov Let me put it another way. The CQ has been summarised in this sentence at the end: *"how do I actually group each 20 lines together and wait for them to finish before going to the next 20 lines"*. Three of the TQ's answers say to use `wait`, and demonstrate three different ways how to do so. Stack Overflow is not a hand-holding, write-every-last-scrap-of-code, site. Telling the OP what to use, demonstrating it for the case of two commands, and letting the OP trivially extend that for more cases, is a perfectly valid answer. — robinCTS, Jan 13 '18 at 00:04
@VasyaNovikov I have presented a reasonable, logically argued, case proving that all your assertions (bar one) are false. Thus any conclusions based on those cannot be claimed to be true. For the *technically* true assertion, I have shown that multiple TQ answers address that detail. You have ignored all my points and simply re-stated you position. All that indicates is that what you are stating is a (false) irrational *belief*, or that you are trolling, not that you have a well thought-out position based on any actual facts. You are right that explaining further would not help… — robinCTS, Jan 13 '18 at 00:04
…What *would* help would be if you actually read through all my points and showed me where I have erred. Unfortunately, as the old saying goes, *"One can only lead a horse to water. One cannot make him drink it."*. — robinCTS, Jan 13 '18 at 00:04
I recommend reopening this question because its answer is clearer, cleaner, better, and much more highly upvoted than the answer at the linked question, though it is three years more recent. — Dan Nissenbaum, Apr 20 '18 at 15:35
I have removed the duplicate link. For the record, this was earlier marked as a duplicate of https://stackoverflow.com/questions/3004811/how-do-you-run-multiple-programs-in-parallel-from-a-bash-script — tripleee, Jun 07 '18 at 16:01

score 333 · Accepted Answer · edited Aug 24 '18 at 17:17

333

Use the wait built-in:

process1 &
process2 &
process3 &
process4 &
wait
process5 &
process6 &
process7 &
process8 &
wait

For the above example, 4 processes process1 ... process4 would be started in the background, and the shell would wait until those are completed before starting the next set.

From the GNU manual:

wait [jobspec or pid ...]
Wait until the child process specified by each process ID pid or job specification jobspec exits and return the exit status of the last command waited for. If a job spec is given, all processes in the job are waited for. If no arguments are given, all currently active child processes are waited for, and the return status is zero. If neither jobspec nor pid specifies an active child process of the shell, the return status is 127.

edited Aug 24 '18 at 17:17

Augustin

2,444
23
24

answered Oct 23 '13 at 13:35

devnull

118,548
33
236
227

14

So basically `i=0; waitevery=4; for link in "${links[@]}"; do wget "$link" & (( i++%waitevery==0 )) && wait; done >/dev/null 2>&1` – kojiro Oct 23 '13 at 13:48
19

Unless you're sure that each process will finish at the exact same time, this is a bad idea. You need to start up new jobs to keep the current total jobs at a certain cap .... [parallel](http://stackoverflow.com/a/19543286/406281) is the answer. – rsaw Jul 18 '14 at 17:26
1

Is there a way to do this in a loop? – DomainsFeatured Sep 13 '16 at 22:55
I've tried this but it seems that variable assignments done in one block are not available in the next block. Is this because they are separate processes? Is there a way to communicate the variables back to the main process? – Bobby Apr 27 '17 at 07:55

score 97 · Answer 2 · answered Oct 23 '13 at 13:38

97

See parallel. Its syntax is similar to xargs, but it runs the commands in parallel.

answered Oct 23 '13 at 13:38

choroba

231,213
25
204
289

14

This is better than using `wait`, since it takes care of starting new jobs as old ones complete, instead of waiting for an entire batch to finish before starting the next. – chepner Oct 23 '13 at 14:35
5

For example, if you have the list of links in a file, you can do `cat list_of_links.txt | parallel -j 4 wget {}` which will keep four `wget`s running at a time. – Mr. Llama Aug 13 '15 at 19:30
5

There is a new kid in town called [pexec](https://www.gnu.org/software/pexec/) which is a replacement for `parallel`. – slashsbin Nov 02 '15 at 21:42
@Mr.Llama, ...`xargs -P 4 -n 1 wget` would do the same, and without bringing in a big mess of perl into your dependency chain. (Why yes, I *am* a language chauvinist). – Charles Duffy Jul 28 '18 at 01:56
2

Providing an example would be more helpful – jterm Jan 11 '19 at 21:04
1

`parallel --jobs 4 < list_of_commands.sh`, where list_of_commands.sh is a file with a single command (e.g. `wget LINK1`, note without the `&`) on every line. May need to do `CTRL+Z` and `bg` after to leave it running in the background. – weiji14 Mar 02 '20 at 23:01

score 72 · Answer 3 · answered Jun 27 '16 at 06:41

72

In fact, xargs can run commands in parallel for you. There is a special -P max_procs command-line option for that. See man xargs.

answered Jun 27 '16 at 06:41

Vader B

831
6
8

2

+100 this is is great since it is built in and very simple to use and can be done in a one-liner – Clay Jan 23 '19 at 19:44
Great to use for small containers, as no extra packages/dependencies are needed! – Marco Roy Sep 05 '19 at 00:36
1

See this question for examples: https://stackoverflow.com/questions/28357997/running-programs-in-parallel-using-xargs – Marco Roy Sep 05 '19 at 00:37

score 7 · Answer 4 · answered Oct 23 '13 at 13:41

7

You can run 20 processes and use the command:

wait

Your script will wait and continue when all your background jobs are finished.

answered Oct 23 '13 at 13:41

Binpix

79
1

Bash script processing limited number of commands in parallel

4 Answers4

Linked

Related