1

I am trying to write a python class to read the a number of images in parallel using multiprocessing.Pool and threading.Lock. My approach is to create a pool of threads which each thread will read an image and append to a member variable with list type. It will also provide a function to obtain the list when it finished reading all the images.

class ReadFilePool(object):
    # filenames contains a list of image absolute paths
    def __init__(self, filenames):
        self.filenames = filenames
        self.images = []
        self.lock = threading.Lock()
        self.pool = Pool(processes=len(self.filenames))
        self.pool.map(self.read_file, [filename for filename in self.filenames])

    def read_file(self, filename):
        image = get_image(filename)
        self.lock.acquire()
        self.images.append(image)
        self.lock.release()

    def get_images(self):
        images = None
        self.lock.acquire()
        if len(self.filenames) == len(self.images):
            images = self.images
        self.lock.release()
        return images

Then I will try to loop and check if get_images is not None and process the images, e.g.

images = []
completed = False
pool = ReadFilePool(filenames)
while not completed:
    images = pool.get_images()
    completed = (None == images)
# ...process the images

I tried to use the following approaches but I still got the pickle errors like TypeError: can't pickle _thread.lock objects

Approach 1: __setstate__ and __getstate__

Approach 2: __call__

Unfortunately I am not too familiar with python multithreading and Lock and encountered few pickle related errors. Please kindly suggest the correct way to use these classes.

chesschi
  • 666
  • 1
  • 8
  • 36
  • I don't think you want to use `multiprocessing` for this. The overhead of forking all of your processes (assuming linux) is going to likely take more time than just opening the files and reading the first line sequentially. Maybe threading could help if you're IO-bound -- In that case, I'd look into using [`concurrent.futures.ThreadPoolExecutor`](https://docs.python.org/3/library/concurrent.futures.html#executor-objects) to create a thread pool. – mgilson Jan 04 '18 at 04:02
  • @mgilson Sorry I also need to do the same thing for reading an image as well which takes about 0.06 seconds for one image. Does my approach look okay if it applies to an image? I have updated the post to mention images now. – chesschi Jan 04 '18 at 04:11
  • 1
    images, text files, it doesn't really matter. The real problem here is that you're confusing threads with the processes created by multiprocessing. A `threading.lock` does nothing useful in a multiprocessing `Process` (unless that `Process` also managed to spawn more threads I guess...). Which parallelization strategy you pick is very problem dependent. With lots of IO, threads are likely a good choice, but you never really know until you measure. That's why `concurrent.futures` provides a (fairly) unified interface to both APIs -- You can experiment and see how it works out :-). – mgilson Jan 04 '18 at 04:15
  • @mgilson have created another [thread](https://stackoverflow.com/questions/48144461/python-what-is-the-most-efficient-way-to-randomly-read-a-large-number-of-images?noredirect=1#comment83267239_48144461) which uses `concurrent.futures` but found that random multi-thread read (approach 4) takes longer than random read (approach 2). Please can you give me some comments? Many thanks! – chesschi Jan 08 '18 at 06:17

0 Answers0