1

I have a Python3 script that uses subprocess.call to run a program on about 2,300 input files in a directory and there are two output files for each input file. I have these two outputs going into two different directories. I would like to learn how to multiprocess my script so several files can be processed at the same time. I have been reading on the multiprocess library in Python but it might be too advanced for me to understand. Below is the script if the experts have any input. Thanks so much!

Script:

import os
import subprocess
import argparse


parser = argparse.ArgumentParser(description="This script aligns DNA sequences in files in a given directory.")
parser.add_argument('--root', default="/shared/testing_macse/", help="PATH to the input directory containing CDS orthogroup files.")
parser.add_argument('--align_NT_dir', default="/shared/testing_macse/NT_aligned/", help="PATH to the output directory for NT aligned CDS orthogroup files.")
parser.add_argument('--align_AA_dir', default="/shared/testing_macse/AA_aligned/", help="PATH to the output directory for AA aligned CDS orthogroup files.")
args = parser.parse_args()


def runMACSE(input_file, NT_output_file, AA_output_file):
    MACSE_command = "java -jar ~/bin/MACSE/macse_v1.01b.jar "
    MACSE_command += "-prog alignSequences "
    MACSE_command += "-seq {0} -out_NT {1} -out_AA {2}".format(input_file, NT_output_file, AA_output_file)
    # print(MACSE_command)
    subprocess.call(MACSE_command, shell=True)

Orig_file_dir = args.root
NT_align_file_dir = args.align_NT_dir
AA_align_file_dir = args.align_AA_dir

try:
    os.makedirs(NT_align_file_dir)
    os.makedirs(AA_align_file_dir)
except FileExistsError as e:
    print(e)

for currentFile in os.listdir(args.root):
    if currentFile.endswith(".fa"):
        runMACSE(args.root + currentFile, args.align_NT_dir + currentFile[:-3]+"_NT_aligned.fa", args.align_AA_dir +   currentFile[:-3]+"_AA_aligned.fa")
tslb14
  • 11
  • 3
  • related: [Python threading multiple bash subprocesses?](http://stackoverflow.com/q/14533458/4279) – jfs Jan 30 '16 at 12:41

1 Answers1

0

Subprocess functions run any command-line executable in a separate process. You are running java. Multiprocessing runs python code in separate processes, just as threading runs python code in separate threads. The API for the two is intentionally similar. So multiprocessing cannot substitute for non-python subprocess calls.

It would be a waste of processes to use multiple python processes to initiate multiple java processes. You could just as well use multiple threads to make multiple subprocess calls. Or use the async module.

Or make your own scheduler. Wrap your for-if in a generator function.

def fa_file(path):
    for currentFile in os.listdir(path):
        if currentFile.endswith(".fa"):
            yield currentFile
fafiles = fa_file(arg.root)

Make an array of, say, 10 Popen objects. Sleep for some appropriate interval. Upon waking, loop through the array and replace finished subprocesses (.poll() returns something other than None) for as long as next(fafiles) returns something.

EDIT: If you did the image processing in Python code that calls compiled C code (pillow, for instance), then you could use multiprocessing and a Queue loaded with the files to process.

Terry Jan Reedy
  • 18,414
  • 3
  • 40
  • 52