Running the code on all sub-directories simultaneously

Question

I have twelve sub-directories and I have a python code to be run for each sub-directory. If I run it one by one it takes long time. So, I want to run the same code on all twelve sub-directories simultaneously.

My computer has two physical CPUs, each of 12 cores.

I tried the code as follows.

Each subdirectory has several files to be processed.

The following code works for each sub_directory one by one.

import os, glob
from concurrent.futures import ThreadPoolExecutor

wd = "/data0/"
sub_directories = [wd + "jan/", wd + "feb/",wd + "mar/",wd + "apr/",wd + "may/",wd + "jun/",wd + "jul/",wd + "aug/",wd + "sep/",wd + "oct/",wd + "nov/",wd + "dec/"]

for sub_directory in sub_directories:
    files = glob.glob (sub_dir + "*.txt")

    def processor (file):
        data = np.fromfile(file)
        #do some calculations
        result = np.array()
        np.save(result, file + ".npy")
    with ThreadPoolExecutor(max_workers=5) as tpe:
        tpe.map(processer, files)

How can I run this code on all twelve sub-directories simultaneously?

Are you sure that's really a good idea? If reading the data takes more time than processing it, you will only add congestion by having multiple threads compete for access to the disk. (With `multiprocessing` at least you would be able to use both CPUs, but with threading, you are confined to one.) — tripleee, May 03 '23 at 04:20
@tripleee Actually there is much more processing and it takes longer than reading the data. — Sara, May 03 '23 at 04:22
Change `ThreadPoolExecutor` to `ProcessPoolExecutor`, also bump up `max_workers=12`. — Michael Ruth, May 03 '23 at 05:01
@MichaelRuth Thanks and its great for working one by one subdirectory. How can I bump it for working on all subdirs simultaneously ? — Sara, May 03 '23 at 05:06
@tripleee "but with threading, you are confined to one" what? — n. m. could be an AI, May 03 '23 at 05:09
@n.m. The Python GIL prevents you from using more than one CPU in a single process. — tripleee, May 03 '23 at 05:16
@daniel, if you have 12 directories to process, `max_workers=12` will result in the `Executor` assigning one directory to each worker. If the workers are processes and you have at least 13 cores, the directories will be processed simultaneously. — Michael Ruth, May 03 '23 at 06:06

DarkKnight · Accepted Answer · 2023-05-03T05:32:48.150

1

Change your processor() function such that it handles one month at a time. That function should run the glob()

from glob import glob
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from os.path import join
from calendar import month_name

EXECUTOR = ProcessPoolExecutor # set this to ThreadPoolExecutor for multithreading
WD = '/data0'
MONTHS = [m.lower()[:3] for m in month_name[1:]]

def processor(month):
    for file in glob(join(WD, month, '*.txt')):
        pass # process the file here

def main():
    with EXECUTOR() as executor:
        executor.map(processor, MONTHS)

if __name__ == '__main__':
    main()

ThreadPoolExecutor and ProcessPoolExecutor are interchangeable so just change EXECUTOR to whichever is your preferred mechanism

edited May 03 '23 at 05:32

answered May 03 '23 at 05:26

DarkKnight

19,739
3
6
22

Would it be reasonable to apply PPE on 12 months and TPE on the files inside a month ? – Sara May 03 '23 at 05:35
1

That will depend on exactly what you're doing with the files. The code I've shown is easily interchangeable between multiprocessing and multithreading. Try both. Add some timing to see what works best for you. I would not run discrete threads for each file because you're almost certainly going to run into GIL issues thereby reducing overall performance – DarkKnight May 03 '23 at 05:40
Can I remove the last two lines ?`if __name__ == '__main__': main()` And using `with EXECUTOR() as executor: executor.map(processor, MONTHS)` without putting into the `def` ? – Sara May 03 '23 at 05:48
1

@Daniel That depends on your platform. I use MacOS so it's necessary. Not so in Windows. What I've shown is a pattern that I always use. This pattern makes it more portable – DarkKnight May 03 '23 at 05:52
1

See also https://stackoverflow.com/a/69778466 which explains the purpose of `if __name__ == '__main__'` – tripleee May 03 '23 at 05:53
Is there something more that can be done using 'Joblib https://joblib.readthedocs.io/en/latest/` than the `concorrent.futures` ? – Sara May 03 '23 at 06:26
@Daniel Depends on your use-case. Have you tried the pattern in my answer? Was it still too slow? Have you incorporated timing statistics to see where the bottleneck/s is/are? – DarkKnight May 03 '23 at 07:31
@DarkKnight I tried with your pattern, the speed has slightly improved than the pattern I used for each file. However, I expect much improvement... – Sara May 03 '23 at 07:58
@Daniel Was there a significant difference between multithreading and multiprocessing? Did you constrain the number of threads/processes? – DarkKnight May 03 '23 at 08:01
@DarkKnight multiprocessing was somewhat faster than multithreading – Sara May 03 '23 at 08:04

Running the code on all sub-directories simultaneously

1 Answers1