I'm trying to load a large amount of raw data into some hive tables using a parameterized hql script (foo_bar.hql), but the raw data is directory-partitioned by /yyyy/mm/dd, so I wrote a shell script to print out the individual hive commands with date parameters, one per line to stout. The shell script output looks like this:
nohup hive -f foo_bar.hql -hiveconf MONTH=06 -hiveconf DAY=01 >/dev/null 2>1& &
nohup hive -f foo_bar.hql -hiveconf MONTH=06 -hiveconf DAY=02 >/dev/null 2>1& &
nohup hive -f foo_bar.hql -hiveconf MONTH=06 -hiveconf DAY=03 >/dev/null 2>1& &
nohup hive -f foo_bar.hql -hiveconf MONTH=06 -hiveconf DAY=04 >/dev/null 2>1& &
...
(The >/dev/null 2>1& &
part passes the nohup.out output into oblivion so that it doesn't clog things up and also starts the hive command in the background)
If run on their own, each one of these commands takes a decent amount of time to complete. I've got quite a few that need to be run, so I'm trying to parallelize this entire thing by running a subprocess pool using xargs. My usage is as follows:
bash bar_baz.sh | xargs -n 1 -I CMD -P 5 bash -c CMD
For a reason I cannot ascertain, xargs -P 5 doesn't limit the number of concurrent subprocesses to 5, ALL of the commands printed to stout by the shell script get executed simultaneously, and Hive subsequently crashes. I feel like it's something to do with nohup, but after looking through the man pages for both xargs and nohup and scouring the interwebs for similar usage examples, I still cant figure out what's happening.
Any help would be greatly appreciated! Thanks!