0

I need to write a shell (bash) script that will be executing several Hive queries. Each of the queries will produce a directory with a lot of files.
After all queries are finished I need to process all these files in a specific order.
I want to run Hive queries in parallel as background processes as each one might take couple of hours.
I would also like to parallelize resulting file processing but there are some culprits, that I don't know how to handle. I.e. I can start processing results of the first and second queries as soon as they are finished, but for the third, I need to hold until first two processors are done. Similarly for the fourth and fifth.
I won't have any problems writing such a program in Java, but how to do it in shell - beats me.
If someone can give me a hint on how can I monitor execution of these components in the shell script, I would appreciate it greatly.

Gary Greenberg
  • 468
  • 1
  • 9
  • 22
  • Don't reinvent the wheel. Just use GNU parallel or `xargs -P` – that other guy Jul 09 '18 at 20:14
  • use [wait](http://man7.org/linux/man-pages/man1/wait.1p.html) for general do-when-all-are-done tasks. You can list specific pid's to wait for. Wait will halt the program in a friendly manner until it's time to do more stuff. – Paul Hodges Jul 09 '18 at 20:42
  • Check https://stackoverflow.com/questions/356100/how-to-wait-in-bash-for-several-subprocesses-to-finish-and-return-exit-code-0 – lojza Jul 12 '18 at 07:47

0 Answers0