I am recently processing a quiet big dataset, and I intend to use Tfidfvectorizer to analyze it.
There were previous posts regarding MemoryError when implementing Tfidfvectorizer, However, in my case, the MemoryError occurs before I feed data into the Tfidfvectorizer. Here is my code.
read data
data = pd.read_csv(...) data['description'] is the text content
process data
from sklearn.feature_extraction.text import TfidfVectorizer description_vectorizer = TfidfVectorizer(max_features=500, min_df=0.2, ngram_range=(2, 3), preprocessor=preprocessor, stop_words='english') description_vectorizer.fit(data.description.values.astype('U'))
Many posts here talked about MemoryError when fitting Tfidfvectorizer, but I found that when I transform the data into unicode, i.e. IN THIS STEP: data.description.values.astype('U'), the MemoryError occurs.
So, strategies regarding how to tune parameters in Tfidfvectorizer is NOT useful in my case.
Anyone encountered this before and know how to fix it? many thanks.