Why is inference time slower when using multi processing with Keras?

Question

I would like to have several processes, each one loading different images one in a time and performing inference (for example VGG16).

I am using Keras with tensorFlow backend, one GPU (GTX 1070). Following is the code:


import tensorflow as tf
import multiprocessing
from multiprocessing import Pool, Process, Queue
import os
from os.path import isfile, join
from PIL import Image
import time
from keras.applications.vgg16 import VGG16
import numpy as np
from keras.backend.tensorflow_backend import set_session



test_path = 'test path to images ...'
output = Queue()

def worker(file_names, output):
    config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = 0.25
    config.gpu_options.visible_device_list = "0"
    set_session(tf.Session(config=config))
    inference_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3), pooling='avg')
    model_image_size = (224,224)
    times = []

    for file_name in file_names:
        image = Image.open(os.path.join(test_path, file_name))
        im_width = image.size[0]
        im_height = image.size[1]
        m = (im_width - im_height) // 2
        image = image.crop((m, 0, im_width - m, im_height))
        image = image.resize((model_image_size), Image.BICUBIC)
        image = np.array(image, dtype='float32')
        image /= 255.
        image = np.expand_dims(image, 0)  # Add batch dimension.
        start = time.time()
        res = inference_model.predict(image)
        end = time.time()
        elapsed_time = end - start
        print("elapsed time", elapsed_time)
        times.append(elapsed_time)
    average_time = np.mean(times[2:])
    print("average time ", average_time)

if __name__ == '__main__':
    file_names = [f for f in os.listdir(test_path) if isfile(join(test_path, f))]
    file_names.sort()
    num_workers = 3
    processes = [Process(target=worker, args=(file_names[x::num_workers], output)) for x in range(num_workers)]

    for p in processes:
        p.start()
    for p in processes:
        p.join()

I have noticed that the inference elapsed times per image are slower for multi processes compared to single process. For example while for single image the inference elapsed time is 0.012 sec. When running 3 processes, I would expect the same result, however, the average inference time per image is almost 0.02 sec. What could be the reason for that? (Maybe CUDA context – switching?) Is there a way to solve this?

Profile the code. See what's taking the time. Run larger, more realistic jobs. The overhead of parallelism is usually non-zero, so you need to do a substantial amount of work to benefit. — John Zwinck, May 07 '19 at 11:06
you mean profiling the code to get the inner profiling of the Keras predict function? as this is the only function I am measuring — marina, May 07 '19 at 11:10
Yes. But first you need to run a larger, more realistic job. — John Zwinck, May 07 '19 at 11:12
By realistic you mean on more images? Currently I have tested on ~1000 images and the time measurement is obtained as an average . If I run for example 5 process , the times get even more slower — marina, May 07 '19 at 11:36
If the total time taken is 0.01 seconds, that's not a useful test because nobody cares about programs that run that fast. People care about speeding up programs that take many minutes or hours. If your real application takes 0.01 seconds, you should stop optimizing now. — John Zwinck, May 07 '19 at 11:37
These are the time constants as eventually we do real time processing on multiple streams of images (lets say , I would like to process at least 2 images per second and I would like to add more classifiers later on). The problem is that even if the performance is good enough for one process it becomes too slow when running 10 processes for example. — marina, May 07 '19 at 11:46
I've just seen NVIDIA's "Multi-Process Server" that seems to be suggested for this use case. It's been a few years since I've used CUDA, and everything was serialised back then. I'd still be tempted to just do the image loading/scaling in parallel, serialising access to the GPU, there's probably enough work there to keep one GPU busy — Sam Mason, May 09 '19 at 15:03
https://stackoverflow.com/a/34711344/1358308 is an interesting read! — Sam Mason, May 09 '19 at 15:10

Why is inference time slower when using multi processing with Keras?

0 Answers0