3

I'd like to concurrently load multiple grayscale images from files on disk and put them inside a large numpy array in order to speed up loading time. Basic code looks like this:

import numpy as np
import matplotlib.pyplot as plt

# prepare filenames
image_files = ...
mask_files = ...
n_samples = len(image_files)  # == len(mask_files)

# preallocate space
all_images = np.empty(shape=(n_samples, IMG_HEIGHT, IMG_WIDTH), dtype=np.float32)
all_masks = np.empty(shape=(n_samples, IMG_HEIGHT, IMG_WIDTH), dtype=np.float32)

# read images and masks
for sample, (img_path, mask_path) in enumerate(zip(image_files, mask_files)):
    all_images[sample, :, :] = plt.imread(img_path)
    all_masks[sample, :, :] = plt.imread(mask_path)

I'd like to execute this loop in parallel, however, I know that Python true multithreading capabilities are limited due to GIL.

Do you have any ideas?

pzelasko
  • 2,082
  • 1
  • 16
  • 24

1 Answers1

1

You can try doing one for the images and one for the masks

import numpy as np
import matplotlib.pyplot as plt
from threading import Thread

# threading functions
def readImg(image_files, mask_files):
    for sample, (img_path, mask_path) in enumerate(zip(image_files, mask_files)):
        all_images[sample, :, :] = plt.imread(img_path)

def readMask(image_files, mask_files):
    for sample, (img_path, mask_path) in enumerate(zip(image_files, mask_files)):
        all_masks[sample, :, :] = plt.imread(mask_path)


# prepare filenames
image_files = ...
mask_files = ...
n_samples = len(image_files)  # == len(mask_files)

# preallocate space
all_images = np.empty(shape=(n_samples, IMG_HEIGHT, IMG_WIDTH), dtype=np.float32)
all_masks = np.empty(shape=(n_samples, IMG_HEIGHT, IMG_WIDTH), dtype=np.float32)

# threading stuff
image_thread = Thread(target=readImg,
                                args=[image_files, mask_files])
mask_thread = Thread(target=readMask,
                               args=[image_files, mask_files])

image_thread.daemon = True
mask_thread.daemon = True

image_thread.start()
mask_thread.start()

WARNING: Do not copy this code. I did not test this either, it is just to get the gist of it.

This will not use multiple cores, and it will not be executed linearly as your code above would have. If you want that you have to use a Queue implementation. Although, I assume this is not what you want since you said you wanted concurrency and was aware of the interpreter lock on python threads.

Edit - As per your comment, see this post regarding using multiple cores Multiprocessing vs Threading Python, to make the change with above example, simply use the line

import multiprocessing.Process as Thread

They share a similar API.

Community
  • 1
  • 1
Pax Vobiscum
  • 2,551
  • 2
  • 21
  • 32
  • As you noticed, your solution will execute on single core only, hence I wouldn't expect any improvement in regard to the single-threaded code. However, if there were some library which would be able to execute similiar code in native threads (bypassing the GIL), that would be pretty close to what I wanted. – pzelasko Jul 18 '16 at 09:45
  • After reading your link, I've found [a question about sharing numpy arrays between processes](http://stackoverflow.com/questions/7894791/use-numpy-array-in-shared-memory-for-multiprocessing), so I think that combining both answers may help me resolve my issue. I'll look into it later. Thanks! – pzelasko Jul 18 '16 at 10:41