1

I want to start two different python scripts (tensorflow object detection train.py and eval.py) in parallel on different GPUs, and when train.py is completed, kill eval.py.

I have the following code to start two subprocesses in parallel (How to terminate a python subprocess launched with shell=True). But the subprocesses are started on the same device (I can guess why. I just don’t know how to start them on different devices).

start_train = “CUDA_DEVICE_ORDER= PCI_BUS_ID CUDA VISIBLE_DEVICES=0 train.py ...”

start_eval = “CUDA_DEVICE_ORDER= PCI_BUS_ID CUDA VISIBLE_DEVICES=1 eval.py ...”

commands = [start_train, start_eval]

procs = [subprocess.Popen(i, shell=True, stdout=subprocess.PIPE, preexec_fn=os.setsid) for i in commands]

After this point I don’t know how to proceed. Do I need something like below? Should I use p.communicate() instead to avoid deadlocks? Or is it enough if I just call wait() or communicate() for train.py as I need only its completion.

for p in procs:
    p.wait() # I assume this command won’t affect the parallel running

Then I need to use the following command somehow. I don’t need a return value from train.py, but a return code from subprocess alone. Popen.returncode documentation wait() and communicate() look like needing a return code setting. I don’t understand how to set this. I prefer something like

if train is done without any error:
    os.killpg(os.getpgid(procs[1].pid), signal.SIGTERM) 
else:
    write the error to the console, or to a file (but how?)

OR?

train_return = proc[0].wait() 
if train_return == 0:
    os.killpg(os.getpgid(procs[1].pid), signal.SIGTERM) 

UPDATE AFTER SOLVING THE PROBLEM:

This is my main:

if __name__ == "__main__":
    exp = 1
    go = True
    while go:


        create_dir(os.path.join(MAIN_PATH,'kitti',str(exp),'train'))
        create_dir(os.path.join(MAIN_PATH,'kitti',str(exp),'eval'))


        copy_tree(os.path.join(MAIN_PATH,"kitti/eval_after_COCO"), os.path.join(MAIN_PATH,"kitti",str(exp),"eval"))
        copy_tree(os.path.join(MAIN_PATH,"kitti/train_after_COCO"), os.path.join(MAIN_PATH,"kitti",str(exp),"train"))

        err_log = open('./kitti/'+str(exp)+'/error_log' + str(exp) + '.txt', 'w')

        train_command = CUDA_COMMAND_PREFIX + "0 python3 " + str(MAIN_PATH) + "legacy/train.py \
                                            --logtostderr --train_dir " + str(MAIN_PATH) + "kitti/" \
                                            + str(exp) + "/train/ --pipeline_config_path " + str(MAIN_PATH) \
                                            + "kitti/faster_rcnn_resnet101_coco.config"
        eval_command = CUDA_COMMAND_PREFIX + "1 python3 " + str(MAIN_PATH) + "legacy/eval.py \
                                            --logtostderr --eval_dir " + str(MAIN_PATH) + "kitti/" \
                                            + str(exp) + "/eval/ --pipeline_config_path " + str(MAIN_PATH) \
                                            + "kitti/faster_rcnn_resnet101_coco.config --checkpoint_dir " + \
                                            str(MAIN_PATH) + "kitti/" + str(exp) + "/train/"

        os.system("python3 dataset_tools/random_sampler_with_replacement.py --random_set_id " + str(exp))
        time.sleep(20)
        update_train_set(exp)



        train_proc = subprocess.Popen(train_command,
                                  stdout=subprocess.PIPE,
                                  stderr=err_log, # write errors to a file
                                  shell=True)
        time.sleep(20)      
        eval_proc = subprocess.Popen(eval_command,
                                 stdout=subprocess.PIPE,
                                 shell=True)
        time.sleep(20)

        if train_proc.wait() == 0: # successfull termination
            os.killpg(os.getpgid(eval_proc.pid), subprocess.signal.SIGTERM)

        clean_train_set(exp)
        time.sleep(20)
        exp += 1
        if exp == 51:
            go = False
kneazle
  • 335
  • 4
  • 14

1 Answers1

1

By default, TensorFlow assigns operations to the "/gpu:0" (or "/cpu:0") even if you have multiple GPUs. The only way to solve it is to assign each operation manually to the second GPU in one of your scripts using context manager

with tf.device("/gpu:1"):
    # your ops here

UPDATE

If I understand you correctly, what you need is the following:

import subprocess
import os
err_log = open('error_log.txt', 'w')
train_proc = subprocess.Popen(start_train,
                              stdout=subprocess.PIPE,
                              stderr=err_log, # write errors to a file
                              shell=True)
eval_proc = subprocess.Popen(start_eval,
                             stdout=subprocess.PIPE,
                             shell=True)

if train_proc.wait() == 0: # successfull termination
    os.killpg(os.getpgid(eval_proc.pid), subprocess.signal.SIGTERM)
# else, errors will be written to the 'err_log.txt' file
Vlad
  • 8,225
  • 5
  • 33
  • 45
  • I can start the scripts on preferred gpu's with that command. However my question was different.. – kneazle Nov 21 '18 at 16:10
  • Thanks! It is working. I have one question though, I want to use this setting in a loop (say 100 times). In other words, after train is done and eval is killed, I want to start another pair of subprocesses just like those. However, after eval is killed, nothing starts again and “Terminated” is printed on my console. How could I prevent this to be happening, and start other processes after eval is killed? – kneazle Nov 24 '18 at 21:52
  • There is no error in the log file, it says that train is finished and the model is saved to disk. On the console, there is no error again, it only says “Terminated”. I am adding my main method to my first question since it’s too long for a comment – kneazle Nov 25 '18 at 13:22
  • I think `os.killpg(os.getpgid(eval_proc.pid), subprocess.signal.SIGTERM)` is killing the main process hence it can't go beyond that line. I tried with try/excepts, however I couldn't catch an error yet – kneazle Nov 26 '18 at 10:31
  • 1
    instead of `os.killpg(os.getpgid(eval_proc.pid), subprocess.signal.SIGTERM)` use `eval_proc.kill()`. – Vlad Nov 26 '18 at 11:47
  • I think `eval_proc.kill()` can't kill the process, the next eval process is starting while the previous one is still running and causes memory error – kneazle Nov 26 '18 at 13:40
  • this should help `eval_proc = subprocess.Popen('exec ' + start_eval, stdout=subprocess.PIPE, shell=True)` and `eval_proc.kill()` – Vlad Nov 26 '18 at 13:51
  • adding `preexec_fn=os.setsid` and using `os.killpg(os.getpgid(eval_proc.pid), subprocess.signal.SIGTERM)` is working. Thanks a lot! – kneazle Nov 26 '18 at 13:52