I am used to map and starmap pool methods to distribute a FUNCTION on any kind of iterable object. Here is how I typically extract stem words from the raw content column of a pandas dataframe:
pool = mp.Pool(cpu_nb)
totalvocab_stemmed = pool.map(tokenize_and_stem, site_df["raw_content"])
pool.close()
a good article on function parallelization in python
So far so good. But is there a nice and easy way to parallelize the execution of sklearn METHODS. Here is an example of what I would like to distribute
tfidf_vectorizer = TfidfVectorizer(max_df=0.6, max_features=200000,
min_df=0.2, stop_words=stop_words,
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
tfidf_matrix = tfidf_vectorizer.fit_transform(self.site_df["raw_content"])
tfidf_matrix is not an element by element list so splitting site_df["raw_content"] in as many elements as I have cores in my CPU to perform a GOF pool and stack everything back together later on is not an option. I saw some interesting options:
- the IPython.parallel Client source
- use the parallel_backend function of sklearn.externals.joblib as a context source
I might be dumb but I wasn't very successful in both attempts. How would you do this?