So here is my code, i'm applying the function on a large dataset (37k rows) and i want to make run on multiple threads, or any other way to make it faster, i've tested Spark and Dask libraries, but I got caught in errors that i couldn't solve. If you guys have any idea that would be great.
import matplotlib.pyplot as plt
def caption_from_image_file(x):
return [str(get_caption(i,device)) for i in x.load()]
import cv2
import numpy as np
df = dg.getData("train")
df_test = df
# start timer
import time
start_time = time.time()
df_test['captions'] = df_test.images.apply(caption_from_image_file)
# end timer (in minutes)
print("--- %s minutes ---" % ((time.time() - start_time)/60))
df_test.to_csv('test.csv',index=False)
# # free up cuda memory
torch.cuda.empty_cache()
df_test.captions