I would like your input on the following:
Since a short while I've started a job at a new company, which is running processes on a cluster. There is an existing pipeline which is already implemented and roughly does the following:
- Big files are stored per +- 200 on a harddisk (+- 130gb per file)
- Since there is a disk quota on the cluster, and its very IO intensive to copy, I have to limit myself to copy only 1 file over at a time.
- A manager java program will create a pull script to pull the big files across the network (from NAS to cluster).
- After pulling, the analysis pipeline runs on the cluster (black box process to me).
- Next, an 'am I finished' script is checking to see if the process is completed on the cluster. If it isn't, the script sleeps for 10mins and checks again, if it's finished, the big file gets removed (black box script to me).
So at the moment I've made a very simple implementation of a manager program in python, doing this, and once done, execute the next file to copy over and repeat the list of jobs.
My question is, I would like to expand this program, so that it will use 5 (maybe more later) big files to copy at once, and submit them to the cluster, and only delete and remove a process once it's done running.
When looking for a solution, I've seen people mentioning using multiple threads or multiprocessing, specifically to use a pool of workers. I have no experience with this yet (but one can learn right?) but I think it will be a viable option to go with in this case. My question would be, how will I setup a pool of 5 workers so that each worker does a series of tasks, and once completed, takes a new big file from the queue and iterates.