I was trying to get fastText sentence embeddings for 80 Million English tweets using the parallelizing mechanism using dask as described in this answer: How do you parallelize apply() on Pandas Dataframes making use of all cores on one machine?
Here is my full code:
import dask.dataframe as dd
from dask.multiprocessing import get
import fasttext
import fasttext.util
import pandas as pd
print('starting langage: ' + 'en')
lang_output = pd.DataFrame()
lang_input = full_input.loc[full_input.name == 'en'] # 80 Million English tweets
ddata = dd.from_pandas(lang_input, npartitions = 96)
print('number of lines to compute: ' + str(len(lang_input)))
fasttext.util.download_model('en', if_exists='ignore') # English
ft = fasttext.load_model('cc.'+'en'+'.300.bin')
fasttext.util.reduce_model(ft, 20)
lang_output['sentence_embedding'] = ddata.map_partitions(lambda lang_input: lang_input.apply((lambda x: get_fasttext_sentence_embedding(x.tweet_text, ft)), axis = 1)).compute(scheduler='processes')
print('finished en')
This is the get_fasttext_sentence_embedding function:
def get_fasttext_sentence_embedding(row, ft):
if pd.isna(row):
return np.zeros(20)
return ft.get_sentence_vector(row)
But, I get a pickling error on this line:
lang_output['sentence_embedding'] = ddata.map_partitions(lambda lang_input: lang_input.apply((lambda x: get_fasttext_sentence_embedding(x.tweet_text, ft)), axis = 1)).compute(scheduler='processes')
This is the error I get:
TypeError: can't pickle fasttext_pybind.fasttext objects
Is there a way to parallelize fastText model get_sentence_vector with dask (or anything else)? I need to parallelize because getting sentence embeddings for 80 Million tweets takes two much time and one row of my data frame is completely independent of the other.