3

I have a function that I want to apply to a pandas dataframe in parallel

import multiprocessing
from multiprocessing import Pool
from collections import Counter
import numpy as np

def func1():
    # do some operations in serial
    return word_vector_matrix # a numpy ndarray

def get_vector(text):
    vector = np.zeroes(100,)
    for i in range(5):
        vector += word_vector_matrix[text.index(i)]
    return vector

def apply_get_vector(data):
    return data['text_column'].apply(get_vector)


if __name__ == '__main__':

    word_vector_matrix = func1()

    def parallelize_dataframe(data, func):
        num_cores = multiprocessing.cpu_count()
        num_partitions = num_cores # num chunks = num cores
        chunks_dict = {i:np.array_split(data, num_partitions)[i] for i in range(num_partitions)}
        pool = Pool(num_cores)
        word_vectors = np.concatenate(pool.map(func, [chunks_dict[i] for i in chunks_dict]))
        pool.close()
        pool.join()
        return word_vectors

    vectors = parallelize_dataframe(df, apply_get_vector)

The problem is that I get an error

>>> NameError: name 'word_vector_matrix' is not defined

I thought this is strange because I do define word_vector_matrix in main, but I don't explicitly give it as an argument into the get_vector function. Is there something that I am doing wrong here by not declaring variables in scope, and if so how do I correct this?

I did find this question which explains how to use .apply() with multiple arguments which I feel we will need to use here, but my case is slightly different given that I am using multiprocessing. If anyone could please explain how to do this I'd greatly appreciate it. Thanks

PyRsquared
  • 6,970
  • 11
  • 50
  • 86

0 Answers0