Multiprocessing in Python, each process handles part of a file

Question

I have a file that I want to process in Python. Each line in this file is a path to an image, and I would like to call a feature extraction algorithm on each image.

I would like to divide the file into smaller chunks and each chunk will be processed in a parallel separate process. What are the good state-of-the-art libraries or solutions for for this kind of multiprocessing in Python?

Python has a https://wiki.python.org/moin/GlobalInterpreterLock which means threading is only useful when each thread spends time waiting (such as for response from a server) ...to actively do work in parallel you need multiprocessing — Anentropic, Oct 29 '14 at 17:49
Thanks Anenetropic, I will check your links, I guess I need to divide the the data (the file) explicitly and pass each chunk as argument to a function then. — Rami, Oct 29 '14 at 17:52
@Anentropic: it is not true if the computation functions can release GIL (e.g., functions from numpy, lxml, regex modules can release GIL (run in parallel) without multiple processes). Here's [code example (`ctypes` releases GIL before calling C functions)](https://gist.github.com/zed/f199d5a0c453be2e9681). — jfs, Oct 29 '14 at 18:06
true, just as a general guideline you need to be aware of the limitations of the GIL though — Anentropic, Oct 29 '14 at 18:11
why do you think running the code in parallel would make your code faster? [The same questions from the comments apply in your case](http://stackoverflow.com/q/26636394/4279) e.g., how fast is your disk compared to how fast your Python code can process the data? — jfs, Oct 29 '14 at 18:12
A process like this is probably going to be I/O bound, so the GIL likely won't have much impact. Naturally, the pieces of the file will all need to be on separate spindles to get any benefit, though. — kindall, Oct 29 '14 at 18:12
@kindall: whether or not it is I/O-bound depends how much processing each chunk of data may require. — jfs, Oct 29 '14 at 18:14
@J.F. Sebastian, thank you, actually the file that I am reading contains in each line a path to an image, and I need to run a feature extraction algorithm on each image. I have about 50,000 images (so my file contains 50,000 line) and I would like to process these images in parallel to save time. So, in fact I am not really processing the file itself, I am using it to read paths to images and then calling a function (or a binary file) to process the image. — Rami, Oct 29 '14 at 18:17
The feature extraction algorithm itself is will not occupy large memory, but it is slow as it apply several processing on each image. So I guess doing that in parallel will be much faster. Maybe I wasn't clear enough in my question. — Rami, Oct 29 '14 at 18:20

jfs · Accepted Answer · 2014-10-30T10:24:55.223

Your description suggests that a simple thread (or process) pool would work:

#!/usr/bin/env python
from multiprocessing.dummy import Pool # thread pool
from tqdm import tqdm # $ pip install tqdm # simple progress report

def mp_process_image(filename):
    try:
       return filename, process_image(filename), None
    except Exception as e:
       return filename, None, str(e)

def main():
    # consider every non-blank line in the input file to be an image path
    image_paths = (line.strip()
                   for line in open('image_paths.txt') if line.strip())
    pool = Pool() # number of threads equal to number of CPUs
    it = pool.imap_unordered(mp_process_image, image_paths, chunksize=100)
    for filename, result, error in tqdm(it):
        if error is not None:
           print(filename, error)

if __name__=="__main__":
    main()

I assume that process_image() is CPU-bound and it releases GIL i.e., it does the main job in a C extension such OpenCV. If process_image() doesn't release GIL then remove the word .dummy from the Pool import to use processes instead of threads.

Thanks Sebastian for the elegant solution. My last question is that how can I collect the results in order (the result of the first chunk, then the 2nd chunk... etc.)? — Rami, Nov 06 '14 at 10:19
@Rami: If you need the results in order then use `imap()` instead of `imap_unordered()`. — jfs, Dec 01 '14 at 17:19

Multiprocessing in Python, each process handles part of a file

1 Answers1