I have a script that's performing an independent task on about 1200 different files. It loops through each file and checks if it has already been completed or is in progress, if it hasn't been done and isn't being actively worked on (which it wouldn't be if it's not being run in parallel) then it performs a task with the file. This follows the general outline below:
myScript.py
:
for file in directory:
fileStatus = getFileStatus(file)
if fileStatus != 'Complete' and fileStatus != 'inProgress':
setFileStatus(file, 'inProgress')
doTask(file)
setFileStatus(file, 'Complete')
doTask()
takes 20-40 minutes on my machine and will arc from minimal RAM requirements at the beginning to about 8GB toward the middle, and back down to minimal requirements at the end. Depending on the file this will occur over a variable amount of time.
I would like to run this script in parallel with itself so that all tasks are completed in the least amount of time possible, using the maximum amount of my machine's resources. Assuming (in ignorance) the limiting resource is RAM (of which my machine has 64GB), and that the scripts will all have peak RAM consumption at the same time, I could mimic the response to this question in a manner such as:
python myScript.py &
python myScript.py &
python myScript.py &
python myScript.py &
python myScript.py &
python myScript.py &
python myScript.py &
python myScript.py &
However, I imagine I could fit more in depending on where each process is in its execution.
Is there a way to dynamically determine how many resources I have available and accordingly create, destroy or pause instances of this script so that the machine is working at maximum efficiency with respect to time? I would like to avoid making changes to myScript
and instead call it from another which would handle the creating, destroying and pausing.