14

Current scenario: I have 900 files in a directory called directoryA. The files are named file0.txt through file 899.txt, each 15MB in size. I loop through each file sequentially in python. Each file I load as a list, do some operations, and write out an output file in directoryB. When the loop ends I have 900 files in directoryB. The files are named out0.csv through out899.csv.

Problem: The processing of each file takes 3 minutes, making the script run for more than 40 hours. I would like to run the process in a parallel manner as all the files are independent of each other (do not have any inter-dependencies). I have 12 cores in my machine.

The below script runs sequentially. Please help me run it parallel. I have looked at some of the parallel processing modules in python using related stackoverflow questions, but they are difficult for me to understand as I dont have much exposure to python. Thanks a billion.

Pseudo Script

    from os import listdir 
    import csv

    mypath = "some/path/"

    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'

    for files in listdir(inputDir):
        #load the text file as list using csv module 
        #run a bunch of operations
        #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
        #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv
user5199564
  • 141
  • 1
  • 1
  • 3
  • 1
    You won't see a performance increase using threads due to the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock). You'll have to resort to the [multiprocessing](https://docs.python.org/2/library/multiprocessing.html) library. You'll have to start up a Pool and send the file data to the various processes. – rlbond Aug 06 '15 at 20:01
  • Tangential suggestion: instead of using a regexp to pick out the number from `file1.txt` to create `out1.txt`, how about simply `filename.replace('file','out')` – MattH Aug 06 '15 at 20:49
  • Thank you rlbond for suggesting multiprocessing. It was good to understand GIL. – user5199564 Aug 07 '15 at 03:05
  • Thank you MattH. That is definitely a much easier solution. Also, I love tangents :) – user5199564 Aug 07 '15 at 03:06
  • user5199564 - maybe you can share the code you have used ? – Epligam Dec 27 '20 at 11:14

1 Answers1

16

To fully utilize your hardware core, it's better to use the multiprocessing library.

from multiprocessing import Pool

from os import listdir 
import csv

def process_file(file):
    #load the text file as list using csv module 
    #run a bunch of operations
    #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
    #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv

if __name__ == '__main__':
    mypath = "some/path/"

    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'

    p = Pool(12)
    p.map(process_file, listdir(inputDir))

Document of multiprocessing: https://docs.python.org/2/library/multiprocessing.html

Wil Cooley
  • 900
  • 6
  • 18
ruijin
  • 186
  • 3
  • Thank you ruijin and wil cooley for the detailed code. I've been trying to incorporate it into my code, but its failing for an unrelated reason. But when I incorporated your suggested code into a much simpler code, it worked beautifully. – user5199564 Aug 07 '15 at 03:09
  • 2
    Do we need to add p.join() or p.close() at the end? – yuhengd Jun 22 '18 at 18:05
  • I usually write as following==> with multiprocessing.Pool(12) as p: p.map(process_file, listdir(inputDir)). This hasn't given any errors on a py script of mine which uses mp and the script did it's job as I would have expected. Trying to remove the with ... statement and then writing p = multiprocessing.Pool(12); p.map(...); p.join(); p.close(); has complained and the script didn't do its job. Maybe the with ... construction helps anyone. – velenos14 Mar 05 '21 at 19:56