1

I am recently processing a quiet big dataset, and I intend to use Tfidfvectorizer to analyze it.

There were previous posts regarding MemoryError when implementing Tfidfvectorizer, However, in my case, the MemoryError occurs before I feed data into the Tfidfvectorizer. Here is my code.

  1. read data

    data = pd.read_csv(...)
    
    data['description'] is the text content
    
  2. process data

    from sklearn.feature_extraction.text import TfidfVectorizer
    
    description_vectorizer = TfidfVectorizer(max_features=500,
                                         min_df=0.2,                                    
                                         ngram_range=(2, 3),
                                         preprocessor=preprocessor,
                                         stop_words='english')
    
    description_vectorizer.fit(data.description.values.astype('U'))
    

Many posts here talked about MemoryError when fitting Tfidfvectorizer, but I found that when I transform the data into unicode, i.e. IN THIS STEP: data.description.values.astype('U'), the MemoryError occurs.

So, strategies regarding how to tune parameters in Tfidfvectorizer is NOT useful in my case.

Anyone encountered this before and know how to fix it? many thanks.

dstrants
  • 7,423
  • 2
  • 19
  • 27
RyanKao
  • 321
  • 1
  • 5
  • 14

3 Answers3

2

I know, this thread is quite old, but after encountering the same problem just now and not finding any answers I hope this will help someone finding him or herself in the same position.

The solution is quite simple actually, its just a small error in your code: instead of applying astype() to a numpy array like so:

data.description.values.astype('U')

Just swap statements and apply .astype() to the pandas series:

data.description.astype('U').values

Hope this helps!

G. Sliepen
  • 7,637
  • 1
  • 15
  • 31
Schorsch
  • 171
  • 9
0

In case someone wants to know, I found one way to do this Python: use .loc to select data produces different results

Maybe this is a dumb question, please let me know if you think I should remove it, thanks.

RyanKao
  • 321
  • 1
  • 5
  • 14
0

I have experienced precisely this error as well. The astype function returns so quickly and with so little additional memory actually allocated, I can only assume a preemptive calculation of a memory requirement is failing, its failing to find a contiguous memory block, or it's a bug.

I couldn't find much at all on how to solve this problem, so I avoided it by removing the astype conversion completely and converting the underlying data set to Unicode before it is loaded by pandas.

QA Collective
  • 2,222
  • 21
  • 34