I need to write a shell (bash) script that will be executing several Hive queries.
Each of the queries will produce a directory with a lot of files.
After all queries are finished I need to process all these files in a specific order.
I want to run Hive queries in parallel as background processes as each one might take couple of hours.
I would also like to parallelize resulting file processing but there are some culprits, that I don't know how to handle. I.e. I can start processing results of the first and second queries as soon as they are finished, but for the third, I need to hold until first two processors are done. Similarly for the fourth and fifth.
I won't have any problems writing such a program in Java, but how to do it in shell - beats me.
If someone can give me a hint on how can I monitor execution of these components in the shell script, I would appreciate it greatly.
Asked
Active
Viewed 74 times
0

Gary Greenberg
- 468
- 1
- 9
- 22
-
Don't reinvent the wheel. Just use GNU parallel or `xargs -P` – that other guy Jul 09 '18 at 20:14
-
use [wait](http://man7.org/linux/man-pages/man1/wait.1p.html) for general do-when-all-are-done tasks. You can list specific pid's to wait for. Wait will halt the program in a friendly manner until it's time to do more stuff. – Paul Hodges Jul 09 '18 at 20:42
-
Check https://stackoverflow.com/questions/356100/how-to-wait-in-bash-for-several-subprocesses-to-finish-and-return-exit-code-0 – lojza Jul 12 '18 at 07:47