How to do distributed prediction / inferencing with Tensorflow

Question

I want to run distributed prediction on my GPU cluster using TF 2.0. I trained a CNN made with Keras using MirroredStrategy and saved it. I can load the model and use .predict() on it, but I was wondering if this automatically does distributed prediction using available GPUs. If not, how can I run distributed prediction to speed up inference and use all available GPU memory?

At the moment, when running many large predictions, I exceed the memory (needs 17gb) of one of my GPUs (12gb) and the inferencing fails because it runs out of memory:

Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.12GiB

but I have multiple GPUs and would like to use their memory as well. Thanks.

score 1 · Answer 1 · answered Feb 02 '21 at 18:58

I was able to piece together single-worker, multi-GPU prediction as follows (consider it a sketch - it uses plumbing code that's not generally applicable, but should give you a template to go off of):

# https://github.com/tensorflow/tensorflow/issues/37686
# https://www.tensorflow.org/tutorials/distribute/custom_training
def compute_and_write_ious_multi_gpu(path: str, filename_csv: str, include_sampled: bool):
    strategy = tf.distribute.MirroredStrategy()
    util.log('Number of devices: {}'.format(strategy.num_replicas_in_sync))
    (ds, s, n) = dataset(path, shuffle=False, repeat=False, mask_as_input=True)
    dist_ds = strategy.experimental_distribute_dataset(ds)

    def predict_step(inputs):
        images, labels = inputs
        return model(images, training=False)

    @tf.function
    def distributed_predict_step(dataset_inputs):
        per_replica_losses = strategy.run(predict_step, args=(dataset_inputs,))
        return per_replica_losses  # unwrap!?

    # https://stackoverflow.com/questions/57549448/how-to-convert-perreplica-to-tensor
    def unwrap(per_replica):  # -> list of numpy arrays
        if strategy.num_replicas_in_sync > 1:
            out = per_replica.values
        else:
            out = (per_replica,)
        return list(map(lambda x: x.numpy(), out))

    with strategy.scope():
        model = wrap_model()

    util.log(f'Starting distributed prediction for {filename_csv}')
    ious = [unwrap(distributed_predict_step(x)) for x in dist_ds]
    t = ious
    ious = [item for sublist in t for item in
            sublist]  # https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
    util.log(f'Distributed prediction done for {filename_csv}')
    ious = np.concatenate(ious).ravel().tolist()
    ious = round_ious(ious)
    ious = list(zip(ious, ds.all_image_paths))
    ious.sort()
    write_ious(ious, filename_csv, include_sampled)

This does distribute the load across the GPUs, but unfortunately makes very poor use of them - in my particular case the corresponding single-GPU code runs in ~12 hours, and this runs in 7.7 hours, so not even a 2x speedup despite have 8x the number of GPUs.

I think it's mostly a data feeding issue, but I don't know how to fix it. Hopefully someone else can provide some better insights?

With distributed training you need to multiply the size of the prefetched data by the number of GPUs that you're using. Would that help? — Paul, Jul 08 '22 at 14:35

How to do distributed prediction / inferencing with Tensorflow

1 Answers1