1

I have a file that I want to process in Python. Each line in this file is a path to an image, and I would like to call a feature extraction algorithm on each image.

I would like to divide the file into smaller chunks and each chunk will be processed in a parallel separate process. What are the good state-of-the-art libraries or solutions for for this kind of multiprocessing in Python?

Rami
  • 8,044
  • 18
  • 66
  • 108
  • https://docs.python.org/2/library/multiprocessing.html – Anentropic Oct 29 '14 at 17:47
  • Python has a https://wiki.python.org/moin/GlobalInterpreterLock which means threading is only useful when each thread spends time waiting (such as for response from a server) ...to actively do work in parallel you need multiprocessing – Anentropic Oct 29 '14 at 17:49
  • Thanks Anenetropic, I will check your links, I guess I need to divide the the data (the file) explicitly and pass each chunk as argument to a function then. – Rami Oct 29 '14 at 17:52
  • @Anentropic: it is not true if the computation functions can release GIL (e.g., functions from numpy, lxml, regex modules can release GIL (run in parallel) without multiple processes). Here's [code example (`ctypes` releases GIL before calling C functions)](https://gist.github.com/zed/f199d5a0c453be2e9681). – jfs Oct 29 '14 at 18:06
  • true, just as a general guideline you need to be aware of the limitations of the GIL though – Anentropic Oct 29 '14 at 18:11
  • why do you think running the code in parallel would make your code faster? [The same questions from the comments apply in your case](http://stackoverflow.com/q/26636394/4279) e.g., how fast is your disk compared to how fast your Python code can process the data? – jfs Oct 29 '14 at 18:12
  • A process like this is probably going to be I/O bound, so the GIL likely won't have much impact. Naturally, the pieces of the file will all need to be on separate spindles to get any benefit, though. – kindall Oct 29 '14 at 18:12
  • @kindall: whether or not it is I/O-bound depends how much processing each chunk of data may require. – jfs Oct 29 '14 at 18:14
  • @J.F. Sebastian, thank you, actually the file that I am reading contains in each line a path to an image, and I need to run a feature extraction algorithm on each image. I have about 50,000 images (so my file contains 50,000 line) and I would like to process these images in parallel to save time. So, in fact I am not really processing the file itself, I am using it to read paths to images and then calling a function (or a binary file) to process the image. – Rami Oct 29 '14 at 18:17
  • The feature extraction algorithm itself is will not occupy large memory, but it is slow as it apply several processing on each image. So I guess doing that in parallel will be much faster. Maybe I wasn't clear enough in my question. – Rami Oct 29 '14 at 18:20
  • Question is edited for clarification. – Rami Oct 29 '14 at 18:24

1 Answers1

4

Your description suggests that a simple thread (or process) pool would work:

#!/usr/bin/env python
from multiprocessing.dummy import Pool # thread pool
from tqdm import tqdm # $ pip install tqdm # simple progress report

def mp_process_image(filename):
    try:
       return filename, process_image(filename), None
    except Exception as e:
       return filename, None, str(e)

def main():
    # consider every non-blank line in the input file to be an image path
    image_paths = (line.strip()
                   for line in open('image_paths.txt') if line.strip())
    pool = Pool() # number of threads equal to number of CPUs
    it = pool.imap_unordered(mp_process_image, image_paths, chunksize=100)
    for filename, result, error in tqdm(it):
        if error is not None:
           print(filename, error)

if __name__=="__main__":
    main() 

I assume that process_image() is CPU-bound and it releases GIL i.e., it does the main job in a C extension such OpenCV. If process_image() doesn't release GIL then remove the word .dummy from the Pool import to use processes instead of threads.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Thanks Sebastian for the elegant solution. My last question is that how can I collect the results in order (the result of the first chunk, then the 2nd chunk... etc.)? – Rami Nov 06 '14 at 10:19
  • 1
    @Rami: If you need the results in order then use `imap()` instead of `imap_unordered()`. – jfs Dec 01 '14 at 17:19