So, I have a python program that I wrote which runs as desired on my Linux OS. I have started the process of making it run on Windows; however, there is one part that I cannot figure out how to make work.
The program is structured to be highly parallelizable by having a parent process and lots of child processes and I launch the whole thing with a bash script:
e.g.
# set up some necessary folder stuff
...
python -u parent.py --exp_name $arg --args_file $fname > ./../results_$arg/logs/master.log 2>&1 &
sleep 1
for((i=0; i<10; i++)); do
OMP_NUM_THREADS=1 python -u child.py --id $i --exp_name $arg --args_file $fname > ./../results_$arg/logs/w$i.log 2>&1 &
sleep 1
done
wait
Now for the weird part. The parallel child.py
scripts, over time, will consume quite a few of the systems resources, therefore I have things set up such that every so often, the parent program will send a signal to the child processes for them to die and restart. I do this in the following way:
while not done:
try: #This try except block handles a manual control-c kill.
while not child.hasTask():
# if you're waiting for a task and
# your alive flag is removed by the parent
# kill yourself.
if not os.path.exists(child.alive) or os.path.exists(child.alive + '.cycle'):
done = True
break
...
if os.path.exists(child.alive + '.done'):
print("task completely finished")
else:
if os.path.exists(child.alive + '.cycle'):
os.remove(child.alive + '.cycle')
print(f"refreshing worker {child.id}")
os.system(f'bash refreshWorker.sh {line_args.exp_name} {line_args.args_file} {line_args.id}')
# END OF SCRIPT
The refreshWorker
bash script simply launches a new python process with the same parameters as the one that just finished. OMP_NUM_THREADS=1 python -u child.py --id $i --exp_name $arg --args_file $fname > ./../results_$arg/logs/w$i.log 2>&1 &
This all works.
I've been playing around with windows, and am finding that I cannot replicate this structure easily. For example, simply changing the bash scripts to command scripts via:
START /B "" python -u child.py {...insert args}
and then have the child script call os.system("cmd refreshWorkers.cmd {...insert args}
.
is not working in small test cases I have (i.e. foo.cmd --> bar.py (die) --> baz.cmd (get revived) --> bar.py (die for real)
)
If the actual files would be helpful:
This is the launch point.
This is the child/worker program where it calls the refresh script and then dies.
This is the bash script that launches a just-killed worker script.