Say I have a dataset, like
iris = pd.DataFrame(sns.load_dataset('iris'))
I can use Spacy
and .apply
to parse a string column into tokens (my real dataset has >1 word/token per entry of course)
import spacy # (I have version 1.8.2)
nlp = spacy.load('en')
iris['species_parsed'] = iris['species'].apply(nlp)
result:
sepal_length ... species species_parsed
0 1.4 ... setosa (setosa)
1 1.4 ... setosa (setosa)
2 1.3 ... setosa (setosa)
I can also use this convenient multiprocessing function (thanks to this blogpost) to do most arbitrary apply functions on a dataframe in parallel:
from multiprocessing import Pool, cpu_count
def parallelize_dataframe(df, func, num_partitions):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_partitions)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
for example:
def my_func(df):
df['length_of_word'] = df['species'].apply(lambda x: len(x))
return df
num_cores = cpu_count()
iris = parallelize_dataframe(iris, my_func, num_cores)
result:
sepal_length species length_of_word
0 5.1 setosa 6
1 4.9 setosa 6
2 4.7 setosa 6
...But for some reason, I can't apply the Spacy parser to a dataframe using multiprocessing this way.
def add_parsed(df):
df['species_parsed'] = df['species'].apply(nlp)
return df
iris = parallelize_dataframe(iris, add_parsed, num_cores)
result:
sepal_length species length_of_word species_parsed
0 5.1 setosa 6 ()
1 4.9 setosa 6 ()
2 4.7 setosa 6 ()
Is there some other way to do this? I'm loving Spacy for NLP but I have a lot of text data and so I'd like to parallelize some processing functions, but ran into this issue.