BACKGROUND:
I have a huge file .txt
which I have to process. It is a data mining
project.
So I've split it to many .txt
files each one 100MB
size, saved them all in the same directory and managed to run them this way:
from multiprocessing.dummy import Pool
for filename in os.listdir(pathToFile):
if filename.endswith(".txt"):
process(filename)
else:
continue
In Process, I parse the file into a list of Objects, then I apply another function. This is SLOWER than running the whole file AS IS. But for big enough files I won't be able to run at once and I will have to slice. So I want to have threads as I don't have to wait for each process(filename)
to finish.
How can I apply it ? I've checked this but I didn't understand how to apply it to my code...
Any help please would be appreciated. I looked here to see how to do this. What I've tried:
pool = Pool(6)
for x in range(6):
futures.append(pool.apply_async(process, filename))
Unfortunately I realized it will only do the first 6 text files, or will it not ? How can I make it work ? as soon as a thread is over, assign to it another file text and start running.
EDIT:
for filename in os.listdir(pathToFile):
if filename.endswith(".txt"):
for x in range(6):
pool.apply_async(process(filename))
else:
continue