1

I am used to map and starmap pool methods to distribute a FUNCTION on any kind of iterable object. Here is how I typically extract stem words from the raw content column of a pandas dataframe:

pool = mp.Pool(cpu_nb)
totalvocab_stemmed = pool.map(tokenize_and_stem, site_df["raw_content"])
pool.close()

a good article on function parallelization in python

So far so good. But is there a nice and easy way to parallelize the execution of sklearn METHODS. Here is an example of what I would like to distribute

tfidf_vectorizer = TfidfVectorizer(max_df=0.6, max_features=200000,
                             min_df=0.2, stop_words=stop_words,
                             use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(self.site_df["raw_content"])

tfidf_matrix is not an element by element list so splitting site_df["raw_content"] in as many elements as I have cores in my CPU to perform a GOF pool and stack everything back together later on is not an option. I saw some interesting options:

  • the IPython.parallel Client source
  • use the parallel_backend function of sklearn.externals.joblib as a context source

I might be dumb but I wasn't very successful in both attempts. How would you do this?

zar3bski
  • 2,773
  • 7
  • 25
  • 58
  • 1
    See https://stackoverflow.com/questions/28396957/sklearn-tfidf-vectorizer-to-run-as-parallel-jobs You can just parallelize the transforming process afterwards, but the fitting process needs to be one process I think. – RichieK Feb 17 '19 at 16:48

0 Answers0