multiprocessing ThreadPool reading and writing files

Question

I got over 10000 files that i need to open to and in some of them I need to delete part of the data tried to do it with threadpool but from the time its taking i dont thinks its works

from multiprocessing.pool import ThreadPool

def readwrite(file):
    with open(file,'rb') as f:
        #check something
    #if check something is True
    #else return
    with open(new_file,'wb') as f:
        with open(file,'rb') as g:
             #here i write only the lines i need from the first file
pool = ThreadPool(40)
for file in files:
    pool.apply_async(readwrite,(file,))

Read/write operations usually are the bottleneck of any multithread solution, but in your code it seems to be the only operation. With this conditions multithreading turns code into a one single bottleneck if not narrow it. — Olvin Roght, Jan 25 '22 at 22:10
When you try it with 20 files and check the modified files are the modified files correct? In other words, does your solution produce the correct output even if *slow*? — wwii, Jan 25 '22 at 22:12
how can i be sure that the problem is in the io operation? just checked couple of the outputs and they looks correct and i didnt not tried ThreadPoolExecutor — idonthavename, Jan 25 '22 at 22:22

Alex Kosh · Answer 1 · 2022-01-25T22:19:41.717

0

Look at example from docs:

pool.apply_async(f, (20,)) # runs in *only* one process

As it said, such call uses only one process/thread from pool.

You should try pool.map() instead. Here is the example:

from multiprocessing import Pool

def readwrite(filename):
    pass
    # Your code here

if __name__ == '__main__':
    with Pool(5) as p:
        results = p.map(readwrite, filenames)

edited Jan 25 '22 at 22:19

answered Jan 25 '22 at 22:14

Alex Kosh

2,206
2
19
18

just to be sure do u mean apply_async only use the same thread from the pool each time? – idonthavename Jan 25 '22 at 22:32
Probably not, but I think, using `apply_async` without `.get()` might block pool after 40 calls, `map()` should solve that – Alex Kosh Jan 25 '22 at 22:37

score 0 · Answer 2 · answered Jan 25 '22 at 22:20

you can't do that, if you have over 10,000 files , it means potentially over 10,000 threads , and that's too much for the regular computer to handle in single core, it's better to go over them 1 by 1 if you're about to use threads, you could calculate the time it will take tho and keep a track about how much files did you rewrite , the processing power and computing power is what it is. you could try to maximize that with the multiprocessing module if you have several cores on your CPU "The “multi” in multiprocessing refers to the multiple cores in a computer's central processing unit (CPU)" , usually its 2 or 4 cores , and you do it like in the example below from How to use multiprocessing pool.map with multiple arguments

def multi_run_wrapper(args):
   return add(*args)

def add(x,y):
    return x+y

if __name__ == "__main__":
    from multiprocessing import Pool
    pool = Pool(4)
    results = pool.map(multi_run_wrapper,[(1,2),(2,3),(3,4)])
    print results

pool = ThreadPool(40) doesnt it mean maximum threads will be 40? and I'm using apply_async cause I don't have list of all the files they are in different folders I'm getting all them all with os.walk i can add them to list and then use pool.map(func,files) but why? — idonthavename, Jan 25 '22 at 22:31
Threadpool means you open at max 40 threads, on the same core. `Thread - multi programs on the same processor` `Multiprocess - using another processor at the same time` — DanielGzgzz, Jan 25 '22 at 23:09

multiprocessing ThreadPool reading and writing files

2 Answers2