Python - multiprocessing and text file processing

Question

BACKGROUND: I have a huge file .txt which I have to process. It is a data mining project. So I've split it to many .txt files each one 100MB size, saved them all in the same directory and managed to run them this way:

from multiprocessing.dummy import Pool

for filename in os.listdir(pathToFile):
    if filename.endswith(".txt"):
       process(filename)
    else:
       continue

In Process, I parse the file into a list of Objects, then I apply another function. This is SLOWER than running the whole file AS IS. But for big enough files I won't be able to run at once and I will have to slice. So I want to have threads as I don't have to wait for each process(filename) to finish.

How can I apply it ? I've checked this but I didn't understand how to apply it to my code...

Any help please would be appreciated. I looked here to see how to do this. What I've tried:

pool = Pool(6)
for x in range(6):
    futures.append(pool.apply_async(process, filename))

Unfortunately I realized it will only do the first 6 text files, or will it not ? How can I make it work ? as soon as a thread is over, assign to it another file text and start running.

EDIT:

for filename in os.listdir(pathToFile):
    if filename.endswith(".txt"):
       for x in range(6):
           pool.apply_async(process(filename))
    else:
       continue

pass all your filenames in the loop. 6 means that 6 files will be processed at the same time. But not sure you'll gain speed because of python GIL and threads. You should look at multiprocessing instead. — Jean-François Fabre, Feb 01 '17 at 10:22
@roganjosh, it is the same program so it has to be thread, don't it ? — Hertha BSC fan, Feb 01 '17 at 10:23
@Jean-FrançoisFabre `from multiprocessing.dummy import Pool ` — Hertha BSC fan, Feb 01 '17 at 10:23
No, you can spawn multiple processes using the [`multiprocessing`](https://docs.python.org/2/library/multiprocessing.html) module. As was said, the GIL in Python means that only one thread can ever execute code at once, so multithreading will not lead to any increase in speed. — roganjosh, Feb 01 '17 at 10:24
@roganjosh yes it is multiprocessing. But I don't know to put it in my code as I am totally new to python. — Hertha BSC fan, Feb 01 '17 at 10:25
@Jean-FrançoisFabre How should I apply that ? Look at EDIT in my post please and just indicate if its right. — Hertha BSC fan, Feb 01 '17 at 10:33
It's hard to answer from your starting point, the two links aren't really related to what you're trying to do (the distinction between threads and processes is really important here). Firstly, I probably wouldn't use a `Pool` for I/O. What may be easier is to get a list of all files you want to read, chunk that list and give each chunk to a separate [`Process`](https://docs.python.org/2/library/multiprocessing.html#multiprocessing.Process) — roganjosh, Feb 01 '17 at 10:33
The edit, I think, is getting further off-course. I don't understand `for x in range(6):`. But what it seems you're doing is giving one file at a time to a process pool (which I imagine means they either fight for access, duplicating work or only one process runs) rather than dividing the work up so that each process can do its own thing such that you can combine the parts back at the end to obtain the full result. — roganjosh, Feb 01 '17 at 10:39
@roganjosh `for x in range(6)` i think is meant to find a pool waiting to be assigned a `.txt` to start running it. I am totally lost I think. — Hertha BSC fan, Feb 01 '17 at 10:43
Does this help at all? http://stackoverflow.com/questions/23794207/multiprocess-multiple-files-in-a-list — roganjosh, Feb 01 '17 at 10:49
@roganjosh, THANK YOU! seems like a good link. If it takes me to finish 10MB text file 1 minute, then if I split 100MB into 10MB files, then this should finish in 1-2 minute as well ? — Hertha BSC fan, Feb 01 '17 at 11:00
Ahh, this changes the nature of your problem a bit. If your numbers are anything like what you have in reality, then the bottleneck is not I/O bound (reading the file) and you can easily store 100MB in memory. You're not getting a speedup through splitting the files, but by delegating more processes into processing the data read from the file. Don't split your data needlessly. — roganjosh, Feb 01 '17 at 11:06
@roganjosh, I have a 10GB file which I cannot have on main memory. So I slip to 100MB and process one after another. But for now I am trying it on 100MB which is split to 10MB. — Hertha BSC fan, Feb 01 '17 at 11:08

mata · Accepted Answer · 2017-02-01T11:07:36.017

4

First, using multiprocessing.dummy will only give you a speed increase if your problem is IO bound (when reading the files is the main bottleneck), for CPU intensive tasks (processing the file is the bottleneck) it won't help, in which case you should use "real" multiprocessing.

The problem you describe seems more fit for the use of one of the map functions of Pool:

from multiprocessing import Pool
files = [f for f in os.listdir(pathToFile) if f.endswith(".txt")]
pool = Pool(6)
results = pool.map(process, files)
pool.close()

This will use 6 worker processes to process the list of files and return a list of the return values of the process() function after all files have been processed. Your current example would submit the same file 6 times.

edited Feb 01 '17 at 11:07

answered Feb 01 '17 at 10:53

mata

67,110
10
163
162

Nice, simple answer. Don't you have to `close()` and `join()` the pool to access the results? – roganjosh Feb 01 '17 at 10:55
I don't have a list of files. I am using `for filename in os.list...` to access all the `.txt` files in a specific folder. – Hertha BSC fan Feb 01 '17 at 11:05
@roganjosh no, you don't _have_ to use `join()` when using `map()` because when it returns all workers have already completed their tasks. Calling `close()` allows the workers to terminate, so that's good practice, thx for the hint. – mata Feb 01 '17 at 11:07
@HerthaBSCfan `files` is a [_list comprehension_](http://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/) that is giving you a list of file names. – roganjosh Feb 01 '17 at 11:07
@roganjosh :( my program now is not finishing. Without pool it runs for 20 minutes. With pool its been running for an hour now and still running... – Hertha BSC fan Feb 01 '17 at 12:09
@HerthaBSCfan did it ever finish? If not and you are unable to find out why, it might be appropriate to raise as another question with `process` function shown, or at least something that does something similar that makes it reproducible. – roganjosh Feb 01 '17 at 16:33
Yea, without knowing what exactly the process function is doing it's hard to say more. – mata Feb 01 '17 at 16:42

Python - multiprocessing and text file processing

1 Answers1