I have a function that I want to apply to a pandas dataframe in parallel
import multiprocessing
from multiprocessing import Pool
from collections import Counter
import numpy as np
def func1():
# do some operations in serial
return word_vector_matrix # a numpy ndarray
def get_vector(text):
vector = np.zeroes(100,)
for i in range(5):
vector += word_vector_matrix[text.index(i)]
return vector
def apply_get_vector(data):
return data['text_column'].apply(get_vector)
if __name__ == '__main__':
word_vector_matrix = func1()
def parallelize_dataframe(data, func):
num_cores = multiprocessing.cpu_count()
num_partitions = num_cores # num chunks = num cores
chunks_dict = {i:np.array_split(data, num_partitions)[i] for i in range(num_partitions)}
pool = Pool(num_cores)
word_vectors = np.concatenate(pool.map(func, [chunks_dict[i] for i in chunks_dict]))
pool.close()
pool.join()
return word_vectors
vectors = parallelize_dataframe(df, apply_get_vector)
The problem is that I get an error
>>> NameError: name 'word_vector_matrix' is not defined
I thought this is strange because I do define word_vector_matrix
in main
, but I don't explicitly give it as an argument into the get_vector
function. Is there something that I am doing wrong here by not declaring variables in scope, and if so how do I correct this?
I did find this question which explains how to use .apply()
with multiple arguments which I feel we will need to use here, but my case is slightly different given that I am using multiprocessing
. If anyone could please explain how to do this I'd greatly appreciate it. Thanks