Try the following xargs -P
solution:
for h in "${BACKUP_HOSTS[@]}";do
dbs=$(mongo --eval "db.getMongo().getDBNames()" --host "$h" | grep '"' | tr -d '",')
for db in $dbs; do
mongo "$db" --host "$h" --quiet --eval "db.getCollectionNames()" | tr -d ',"[]' |
xargs -P 0 -n 1 mongodump --host "$h" -q "{_id:{\$gt:$oid}}" -d "$db" --out "/data/daily/$TODAY/$h" -c
done
done
xargs
operates on stdin input only, whereas your solution attempt doesn't provide any stdin input; the solution above pipes the result of the collection name-retrieving mongo
directly to xargs
.
- Note that this assumes that the collection names have neither embedded whitespace nor embedded
\
chars.
-P 0
only works with GNU xargs
, which interprets 0
as: "run as many processes as possible simultaneously" (I'm unclear on how that is defined).
Looping over command output with for
is generally brittle.
- It will only work reliably if (a) each whitespace-space separated word should be treated as its own argument and (b) these words do not contain globbing characters such as
*
- see this Bash FAQ entry.
Note how all variable references (except those known to contain numbers only) are double-quoted for robustness.
Modern command substitution syntax $(...)
is used instead of legacy syntax `...`
, which is preferable.
As for the GNU parallel
command, try the following variation, with stdin input from the collection name-retrieving mongo
command, as above:
... | parallel -P 0 -N 1 -q mongodump --host "$h" -q "{_id:{\$gt:$oid}}" -d "$db" -c {1} --out "/data/daily/$TODAY/$h"
-N 1
rather than -n 1
allows you to control where the read-from-stdin argument is placed on the command line, using placeholder {1}
-q
ensures that passing complex commands with double quotes are passed through to the shell correctly.
Shell variable references are double-quoted to ensure their use as-is.
Troubleshooting xargs
/ GNU parallel
invocations:
Both xargs
and GNU parallel
support -t
(GNU parallel: alias --verbose
):
-t
prints each command line, to stderr, just before it is launched.
- Caveat: With
xargs
, input quoting will not be reflected in the command printed, so you won't be able to verify argument boundaries as specified.
xargs -t
example:
$ time echo '"echo 1; sleep 1" "echo 2; sleep 2" "echo 3; sleep 1.5"' |
xargs -t -P 2 -n 1 sh -c 'eval "$1"' -
This yields something like:
sh -c eval "$1" - echo 1; sleep 1
sh -c eval "$1" - echo 2; sleep 2
2
1
sh -c eval "$1" - echo 3; sleep 1.5
3
real 0m2.535s
user 0m0.013s
sys 0m0.013s
Note:
The command lines lack the original quoting around the arguments, but invocation is still performed as originally specified.
They are printed (to stderr) immediately before the command is launched.
As discussed, output from the commands can arrive out of order and unpredictably interleaved.
Overall execution took about 2.5 seconds, which breaks down as follows:
Due to -P 2
, the echo 1; ...
and echo 2; ...
commands ran in parallel, while the echo 3; ...
command, as the 3rd one was initially held back, because no more than 2 commands are permitted to run at a time.
After 1 second, the echo 1; ...
command finished, dropping the count of running parallel processes down to 1, triggering execution of the remaining echo 3; ...
command.
Therefore, because the the last command was started 1 second later and ran for 1.5 seconds, that last command finished after ca. 2.5 seconds (whereas the first 2 commands already finished after 1 second and 2 seconds, respectively).
GNU parallel -t
example:
$ time echo $'echo 1; sleep 1\necho 2; sleep 2\necho 3; sleep 1.5' |
parallel -P 2 -q -t sh -c 'eval "$1"' -
sh -c eval\ \"\$1\" - echo\ 1\;\ sleep\ 1
sh -c eval\ \"\$1\" - echo\ 2\;\ sleep\ 2
1
sh -c eval\ \"\$1\" - echo\ 3\;\ sleep\ 1.5
2
3
real 0m2.768s
user 0m0.165s
sys 0m0.094s
Note:
Because the command uses quotes to demarcate arguments and that demarcation must be passed through to sh
, -q
must also be specified.
The \
-quoting may look unusual, but it is correct shell quoting and reflects the command exactly as invoked behind the scenes.
GNU parallel
expects argument to be line-separated by default, so the shell commands are passed on individual lines using an ANSI C-quoted string ($'...'
) with \n
escape sequences.
Overall processing takes longer than with xargs
, which is the - probably negligible - price you pay for GNU parallel
's additional features and technological underpinning (Perl script).
One of these additional features is the aforementioned output serialization (grouping): the 1st command's output came predictably first, even though it and the 2nd command launched at the same time; the 2nd command's output wasn't printed until after it finished (which is why the diagnostic printing of the 3rd command line showed first).
GNU parallel
additionally supports --dry-run
, which only prints the commands - to stdout - without actually running them.
$ echo $'echo 1; sleep 1\necho 2; sleep 2\necho 3; sleep 1.5' |
parallel -P 2 -q --dry-run sh -c 'eval "$1"' -
sh -c eval\ \"\$1\" - echo\ 1\;\ sleep\ 1
sh -c eval\ \"\$1\" - echo\ 2\;\ sleep\ 2
sh -c eval\ \"\$1\" - echo\ 3\;\ sleep\ 1.5