Is there a way to run this task in a parallel mode so that it is faster?

Question

So here is my code, i'm applying the function on a large dataset (37k rows) and i want to make run on multiple threads, or any other way to make it faster, i've tested Spark and Dask libraries, but I got caught in errors that i couldn't solve. If you guys have any idea that would be great.

import matplotlib.pyplot as plt


def caption_from_image_file(x):
    return [str(get_caption(i,device)) for i in x.load()]

import cv2
import numpy as np


df = dg.getData("train")

df_test = df

# start timer  
import time
start_time = time.time()


df_test['captions'] = df_test.images.apply(caption_from_image_file)

# end timer (in minutes)
print("--- %s minutes ---" % ((time.time() - start_time)/60))

df_test.to_csv('test.csv',index=False)

# # free up cuda memory
torch.cuda.empty_cache()

df_test.captions

It's hard to help without knowing what `x.load()` is, what `get_caption` is, etc. — AKX, Feb 04 '23 at 20:01
Welcome to StackOverflow. Please read https://stackoverflow.com/help/minimal-reproducible-example and https://stackoverflow.com/help/how-to-ask — Jérôme Richard, Feb 04 '23 at 20:05
The get_caption function takes an image and a PyTorch device as input and generates a caption for the image using a model. The input image is preprocessed and then passed to the model to generate the caption. the x.load return a list of images as numpy arrays — AymaneElmahi, Feb 04 '23 at 20:18

itIsNaz · Answer 1 · 2023-02-04T21:24:58.297

welcome to stack overflow community, please to take @Jérome comment into consideration. I see that you developed a predefined function.

def caption_from_image_file(x):
    return [str(get_caption(i,device)) for i in x.load()]

Seeing the method that you are using you are destroying the parallel processing mechanism due to the for loop that you are using as it has to go through all arrays one by one in your generated list.

Note that this is the part of your code that you have to work on. Unfortunately, parallelisation is not yet implemented in pandas. I advise you to take a look to this thread that we opened since 2013: https://github.com/pandas-dev/pandas/issues/5751

I advise you to take a look into this doc for python multithreading to help you develop your predefined function: https://docs.python.org/3/library/threading.html

This link can help you too: multithreading for data from dataframe pandas

Is there a way to run this task in a parallel mode so that it is faster?

1 Answers1