Python - How to parallel consume and operate on files in a directory

Question

Current scenario: I have 900 files in a directory called directoryA. The files are named file0.txt through file 899.txt, each 15MB in size. I loop through each file sequentially in python. Each file I load as a list, do some operations, and write out an output file in directoryB. When the loop ends I have 900 files in directoryB. The files are named out0.csv through out899.csv.

Problem: The processing of each file takes 3 minutes, making the script run for more than 40 hours. I would like to run the process in a parallel manner as all the files are independent of each other (do not have any inter-dependencies). I have 12 cores in my machine.

The below script runs sequentially. Please help me run it parallel. I have looked at some of the parallel processing modules in python using related stackoverflow questions, but they are difficult for me to understand as I dont have much exposure to python. Thanks a billion.

Pseudo Script

    from os import listdir 
    import csv

    mypath = "some/path/"

    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'

    for files in listdir(inputDir):
        #load the text file as list using csv module 
        #run a bunch of operations
        #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
        #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv

You won't see a performance increase using threads due to the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock). You'll have to resort to the [multiprocessing](https://docs.python.org/2/library/multiprocessing.html) library. You'll have to start up a Pool and send the file data to the various processes. — rlbond, Aug 06 '15 at 20:01
Tangential suggestion: instead of using a regexp to pick out the number from `file1.txt` to create `out1.txt`, how about simply `filename.replace('file','out')` — MattH, Aug 06 '15 at 20:49
Thank you rlbond for suggesting multiprocessing. It was good to understand GIL. — user5199564, Aug 07 '15 at 03:05
Thank you MattH. That is definitely a much easier solution. Also, I love tangents :) — user5199564, Aug 07 '15 at 03:06

score 16 · Answer 1 · edited Aug 06 '15 at 20:40

16

To fully utilize your hardware core, it's better to use the multiprocessing library.

from multiprocessing import Pool

from os import listdir 
import csv

def process_file(file):
    #load the text file as list using csv module 
    #run a bunch of operations
    #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
    #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv

if __name__ == '__main__':
    mypath = "some/path/"

    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'

    p = Pool(12)
    p.map(process_file, listdir(inputDir))

Document of multiprocessing: https://docs.python.org/2/library/multiprocessing.html

edited Aug 06 '15 at 20:40

Wil Cooley

900
6
18

answered Aug 06 '15 at 20:16

ruijin

186
3

Thank you ruijin and wil cooley for the detailed code. I've been trying to incorporate it into my code, but its failing for an unrelated reason. But when I incorporated your suggested code into a much simpler code, it worked beautifully. – user5199564 Aug 07 '15 at 03:09
2

Do we need to add p.join() or p.close() at the end? – yuhengd Jun 22 '18 at 18:05
I usually write as following==> with multiprocessing.Pool(12) as p: p.map(process_file, listdir(inputDir)). This hasn't given any errors on a py script of mine which uses mp and the script did it's job as I would have expected. Trying to remove the with ... statement and then writing p = multiprocessing.Pool(12); p.map(...); p.join(); p.close(); has complained and the script didn't do its job. Maybe the with ... construction helps anyone. – velenos14 Mar 05 '21 at 19:56

Python - How to parallel consume and operate on files in a directory

1 Answers1

Linked