4

I have a large pandas dataframe with 10 million records of news articles. So, this is how I have applied TfidfVectorizer.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(df['articles'])

It took alot of time to process all documents. All I wants to iterate each article in dataframe one at a time or is it possible that I can pass documents in chunks and it keep updating existing vocabulary without overwriting old dictionary of vocabulary?

I have gone through this SO post but not exactly getting how to applied it on pandas. I have also heard about Python generators but not exactly whether its useful here.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
James
  • 528
  • 1
  • 6
  • 18
  • Thats what the tfidfvectorizer is doing. Its iterating the documents one at a time and updating the vocabulary. What else would you like to do. Please explain in more detail. – Vivek Kumar Jul 18 '18 at 06:26
  • @VivekKumar Thanks for the comment. All I want to reduce the iteration time for processing the documents using TfidfVectorizer. What Im doing as if now is consuming more time for calculating resulted matrix as it process all df['article'] at one time I want it to be done one by one. Is it any more professional way to perform Tfidf on large datasets either through `chunks` or passing one document at time in pandas using iterable generators? Hope you got it – James Jul 18 '18 at 07:00
  • As I said above, it does not process all at one time. Inside [`fit()` method of TfidfVectorizer](https://github.com/scikit-learn/scikit-learn/blob/ed5e127b/sklearn/feature_extraction/text.py#L790), it iterates the series you pass and process it one by one, to fill the vocabulary and count matrix. It then processes the count matrix to prepare the tfidf matrix. – Vivek Kumar Jul 18 '18 at 07:12
  • @VivekKumar so, what would the other way around to calculate Tfidf for `df['articles']` which will be more convinient for 10 million records? – James Jul 18 '18 at 07:15
  • Yes, you can speed this up. See here: https://stackoverflow.com/a/26212970/5025009 – seralouk Jul 18 '18 at 09:49

1 Answers1

4

You can iterate in chunks as below. The solution has been adapted from here

def ChunkIterator():
    for chunk in pd.read_csv(csvfilename, chunksize=1000):
      for doc in  chunk['articles'].values:
             yield doc

corpus  = ChunkIterator()
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(corpus)
oldmonk
  • 691
  • 9
  • 16
  • I am using the same code given above (by you) but why it is still consuming too much memory during the fit_transform process I ran out of memory again. Any idea – Mohsin Ashraf Feb 18 '20 at 08:54
  • 1
    not sure probably you can refer to comments at https://stackoverflow.com/questions/53754234/creating-a-tfidfvectorizer-over-a-text-column-of-huge-pandas-dataframe. somebody did get a memory error there also – oldmonk Feb 18 '20 at 09:14