1

Editor's note: This is a follow-up question to a more general one about running a specified number of commands in parallel.

I'm trying to run this mongodb backup script for 20 mongodb servers.

#!/bin/bash
#daily backup for mongo db's

BACKUP_HOSTS=(example.com staging.example.com prod.example.com example1.com)
#d=$(date +%Y-%m-%dT%H:%M:%S --date "-65 days")
d=$(date +%Y-%m-%dT%H:%M:%S --date "-5 days")
oid=$(mongo --quiet --eval "ObjectId.fromDate(ISODate('$d'))")

cd /data/daily/
rm -r /data/daily/*

TODAY=$(date +"%y-%m-%d")
mkdir "$TODAY"

cd $TODAY

#create subfolders

for HOST in ${BACKUP_HOSTS[@]}
do
        mkdir $HOST
done

#extract mongo dumps

echo "$(date '+%Y-%m-%d %H:%M:%S') start retrieving Mongodb backups"

for h in ${BACKUP_HOSTS[@]};do
    dbs=`mongo --eval "db.getMongo().getDBNames()" --host $h | grep '"' | tr -d '",' `
    for db in $dbs; do
       col=`mongo  $db --host $h --quiet --eval "db.getCollectionNames()" | tr -d ',"[]' `
       for collection in $col; do
            xargs -P 0 -n 1 mongodump --host $h -q "{_id:{\$gt:$oid}}" -d $db -c $collection --out /data/daily/$TODAY/$h
       done
    done
done

But is not working.

Tried also with:

parallel -P 0 -n 1 mongodump --host $h "{_id:{\$gt:$oid}}" -d $db -c $collection --out /data/daily/$TODAY/$h

but I get:

bin/bash: -c: line 0: syntax error near unexpected token `('
mklement0
  • 382,024
  • 64
  • 607
  • 775
basante
  • 515
  • 3
  • 9
  • 20
  • General hint to debug similar syntax errors in the future: place `set -x` near the top of the script and you'll immediately see where the extra `(` comes from. – Jens May 23 '17 at 13:12
  • @Jens: That's good advice in general, but doesn't help in this case, because the error message is being issued by GNU Parallel, which invokes `bash` _behind the scenes_. – mklement0 May 23 '17 at 20:00
  • @mklement0 There's no GNU parallel used in the first script and it is not clear the error message doesn't happen in both. My suspicion is, the `$collection` has unquoted parentheses. – Jens May 24 '17 at 06:33
  • 1
    @Jens: Good guess re `$collection`, but the `-c` and `line 0` in the error message tell us that it's not the first script that produced the error. A simple way to provoke the error: `parallel echo 'a(' ::: 1`. Note how the shell command itself _is_ syntactically valid, but the way Parallel invokes it by default causes it to break. Option `-q` fixes that. – mklement0 May 24 '17 at 12:05

3 Answers3

1

Try

mongodump --host $h -q "{_id:{\$gt:$oid}}" -d $db -c $collection > /data/daily/$TODAY/$h &

The & in the end makes the command run in the background, thus every command that is looped will run in parallel with the previous one. Take a look at this too.

But, I would advise you to always enclose your variables in double quotes like this "$var", else many abnormalities can occur and interfere with the execution of your command. Such as this error :

bin/bash: -c: line 0: syntax error near unexpected token `('

seems like it's caused by some special characters your variable $collection has.

Therefore, the safe version of it would be :

mongodump --host "$h" -q "{_id:{\$gt:$oid}}" -d "$db" -c "$collection" > /data/daily/"$TODAY"/"$h" &

You can check out here why and when to use double quotes for more details.

Leajian
  • 129
  • 9
  • It's not explicit in the question, but the OP wants to have an _automated_ way of running _as many jobs as possibly in parallel_, without having to manually manage the partitioning into batches that can run simultaneously and waiting for jobs to end so that new ones can be started (which is what you'd have to do with `&`). Good advice re double-quoting. – mklement0 May 22 '17 at 21:25
0

Try the following xargs -P solution:

for h in "${BACKUP_HOSTS[@]}";do
  dbs=$(mongo --eval "db.getMongo().getDBNames()" --host "$h" | grep '"' | tr -d '",')
  for db in $dbs; do
    mongo "$db" --host "$h" --quiet --eval "db.getCollectionNames()" | tr -d ',"[]' |
      xargs -P 0 -n 1 mongodump --host "$h" -q "{_id:{\$gt:$oid}}" -d "$db" --out "/data/daily/$TODAY/$h" -c 
  done
done
  • xargs operates on stdin input only, whereas your solution attempt doesn't provide any stdin input; the solution above pipes the result of the collection name-retrieving mongo directly to xargs.

    • Note that this assumes that the collection names have neither embedded whitespace nor embedded \ chars.
  • -P 0 only works with GNU xargs, which interprets 0 as: "run as many processes as possible simultaneously" (I'm unclear on how that is defined).

  • Looping over command output with for is generally brittle.

    • It will only work reliably if (a) each whitespace-space separated word should be treated as its own argument and (b) these words do not contain globbing characters such as * - see this Bash FAQ entry.
  • Note how all variable references (except those known to contain numbers only) are double-quoted for robustness.

  • Modern command substitution syntax $(...) is used instead of legacy syntax `...`, which is preferable.


As for the GNU parallel command, try the following variation, with stdin input from the collection name-retrieving mongo command, as above:

... | parallel -P 0 -N 1 -q mongodump --host "$h" -q "{_id:{\$gt:$oid}}" -d "$db" -c {1} --out "/data/daily/$TODAY/$h"
  • -N 1 rather than -n 1 allows you to control where the read-from-stdin argument is placed on the command line, using placeholder {1}

  • -q ensures that passing complex commands with double quotes are passed through to the shell correctly.

  • Shell variable references are double-quoted to ensure their use as-is.


Troubleshooting xargs / GNU parallel invocations:

Both xargs and GNU parallel support -t (GNU parallel: alias --verbose):

  • -t prints each command line, to stderr, just before it is launched.
  • Caveat: With xargs, input quoting will not be reflected in the command printed, so you won't be able to verify argument boundaries as specified.

xargs -t example:

$ time echo '"echo 1; sleep 1" "echo 2; sleep 2" "echo 3; sleep 1.5"' |
    xargs -t -P 2 -n 1 sh -c 'eval "$1"' -

This yields something like:

sh -c eval "$1" - echo 1; sleep 1
sh -c eval "$1" - echo 2; sleep 2
2
1
sh -c eval "$1" - echo 3; sleep 1.5
3

real    0m2.535s
user    0m0.013s
sys 0m0.013s

Note:

  • The command lines lack the original quoting around the arguments, but invocation is still performed as originally specified.

  • They are printed (to stderr) immediately before the command is launched.

  • As discussed, output from the commands can arrive out of order and unpredictably interleaved.

  • Overall execution took about 2.5 seconds, which breaks down as follows:

    • Due to -P 2, the echo 1; ... and echo 2; ... commands ran in parallel, while the echo 3; ... command, as the 3rd one was initially held back, because no more than 2 commands are permitted to run at a time.

    • After 1 second, the echo 1; ... command finished, dropping the count of running parallel processes down to 1, triggering execution of the remaining echo 3; ... command.

    • Therefore, because the the last command was started 1 second later and ran for 1.5 seconds, that last command finished after ca. 2.5 seconds (whereas the first 2 commands already finished after 1 second and 2 seconds, respectively).

GNU parallel -t example:

$ time echo $'echo 1; sleep 1\necho 2; sleep 2\necho 3; sleep 1.5' | 
    parallel -P 2 -q -t sh -c 'eval "$1"' -

sh -c eval\ \"\$1\" - echo\ 1\;\ sleep\ 1
sh -c eval\ \"\$1\" - echo\ 2\;\ sleep\ 2
1
sh -c eval\ \"\$1\" - echo\ 3\;\ sleep\ 1.5
2
3

real    0m2.768s
user    0m0.165s
sys 0m0.094s

Note:

  • Because the command uses quotes to demarcate arguments and that demarcation must be passed through to sh, -q must also be specified.

  • The \-quoting may look unusual, but it is correct shell quoting and reflects the command exactly as invoked behind the scenes.

  • GNU parallel expects argument to be line-separated by default, so the shell commands are passed on individual lines using an ANSI C-quoted string ($'...') with \n escape sequences.

  • Overall processing takes longer than with xargs, which is the - probably negligible - price you pay for GNU parallel's additional features and technological underpinning (Perl script).

  • One of these additional features is the aforementioned output serialization (grouping): the 1st command's output came predictably first, even though it and the 2nd command launched at the same time; the 2nd command's output wasn't printed until after it finished (which is why the diagnostic printing of the 3rd command line showed first).

GNU parallel additionally supports --dry-run, which only prints the commands - to stdout - without actually running them.

$ echo $'echo 1; sleep 1\necho 2; sleep 2\necho 3; sleep 1.5' |
    parallel -P 2 -q --dry-run sh -c 'eval "$1"' -

sh -c eval\ \"\$1\" - echo\ 1\;\ sleep\ 1
sh -c eval\ \"\$1\" - echo\ 2\;\ sleep\ 2
sh -c eval\ \"\$1\" - echo\ 3\;\ sleep\ 1.5
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • @basante: That's not exactly an [MCVE (Minimal, Complete, and Verifiable Example)](http://stackoverflow.com/help/mcve) - there's unrelated code and one would have to have MongoDB installed, but your problem is in how you invoke `xargs`, `parallel` - most unlikely unrelated to `mongo`. I've added a troubleshooting section to my answer. See if this helps you solve the problem; if not, narrow your question down to an MCVE. – mklement0 May 23 '17 at 16:49
0

Just need to add an& at the end of

for collection in $cols; do
        mongodump --host "$h" -q "{_id:{\$gt:$oid}}" -d "$db" -c 
"$collection" --out /data/daily/$TODAY/$h
    done &
basante
  • 515
  • 3
  • 9
  • 20