0

I'm trying to load a large amount of raw data into some hive tables using a parameterized hql script (foo_bar.hql), but the raw data is directory-partitioned by /yyyy/mm/dd, so I wrote a shell script to print out the individual hive commands with date parameters, one per line to stout. The shell script output looks like this:

nohup hive -f foo_bar.hql -hiveconf MONTH=06 -hiveconf DAY=01 >/dev/null 2>1& &
nohup hive -f foo_bar.hql -hiveconf MONTH=06 -hiveconf DAY=02 >/dev/null 2>1& &
nohup hive -f foo_bar.hql -hiveconf MONTH=06 -hiveconf DAY=03 >/dev/null 2>1& &
nohup hive -f foo_bar.hql -hiveconf MONTH=06 -hiveconf DAY=04 >/dev/null 2>1& &
...

(The >/dev/null 2>1& & part passes the nohup.out output into oblivion so that it doesn't clog things up and also starts the hive command in the background)

If run on their own, each one of these commands takes a decent amount of time to complete. I've got quite a few that need to be run, so I'm trying to parallelize this entire thing by running a subprocess pool using xargs. My usage is as follows:

bash bar_baz.sh | xargs -n 1 -I CMD -P 5 bash -c CMD

For a reason I cannot ascertain, xargs -P 5 doesn't limit the number of concurrent subprocesses to 5, ALL of the commands printed to stout by the shell script get executed simultaneously, and Hive subsequently crashes. I feel like it's something to do with nohup, but after looking through the man pages for both xargs and nohup and scouring the interwebs for similar usage examples, I still cant figure out what's happening.

Any help would be greatly appreciated! Thanks!

yungblud
  • 388
  • 4
  • 17

1 Answers1

2

Explanation

For a reason I cannot ascertain, xargs -P 5 doesn't limit the number of concurrent subprocesses to 5, ALL of the commands printed to stout by the shell script get executed simultaneously,

Actually they are limited to 5, but since the commands are sent to the background immediately (due to the & in the output of your shell script), the bash which is started by xargs exits immediately, too. So while xargs is actually running a maximum of 5 processes at once, it started them all in a short time, because they run only for so short.

Solution

I suggest, to:

  • remove the & – xargs relies on the processes not being put into the background
  • either:
    1. move the xargs into bar_baz.sh if possible, or
    2. put bash bar_baz.sh | xargs … into another script
  • remove nohup from the single commands
  • run bar_baz.sh (1) or the other script (2) with nohup instead
  • optional: you may also get rid of the output redirections of the single commands, as you can redirect the output of the whole script at once

Side note

Unrelated, but this is wrong, too: the output redirection from STDERR to STDOUT is not 2>1& – it has to be 2>&1.

Stefan Moch
  • 101
  • 4
  • Absolutely awesome response, I hugely appreciate the explanations! Also thanks for pointing out that typo - it explains why I have a file named 1 that looks suspiciously close to previous nohup.out files... – yungblud Jun 22 '17 at 21:02