I have a large pandas dataframe with 10 million records of news articles. So, this is how I have applied TfidfVectorizer
.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(df['articles'])
It took alot of time to process all documents. All I wants to iterate each article in dataframe one at a time or is it possible that I can pass documents in chunks and it keep updating existing vocabulary without overwriting old dictionary of vocabulary?
I have gone through this SO post but not exactly getting how to applied it on pandas. I have also heard about Python generators
but not exactly whether its useful here.