Python: how to avoid MemoryError when transform text data into Unicode using astype('U')

Question

I am recently processing a quiet big dataset, and I intend to use Tfidfvectorizer to analyze it.

There were previous posts regarding MemoryError when implementing Tfidfvectorizer, However, in my case, the MemoryError occurs before I feed data into the Tfidfvectorizer. Here is my code.

read data

data = pd.read_csv(...)

data['description'] is the text content

process data

from sklearn.feature_extraction.text import TfidfVectorizer

description_vectorizer = TfidfVectorizer(max_features=500,
                                     min_df=0.2,                                    
                                     ngram_range=(2, 3),
                                     preprocessor=preprocessor,
                                     stop_words='english')

description_vectorizer.fit(data.description.values.astype('U'))

Many posts here talked about MemoryError when fitting Tfidfvectorizer, but I found that when I transform the data into unicode, i.e. IN THIS STEP: data.description.values.astype('U'), the MemoryError occurs.

So, strategies regarding how to tune parameters in Tfidfvectorizer is NOT useful in my case.

Anyone encountered this before and know how to fix it? many thanks.

score 2 · Answer 1 · edited Oct 04 '18 at 11:36

I know, this thread is quite old, but after encountering the same problem just now and not finding any answers I hope this will help someone finding him or herself in the same position.

The solution is quite simple actually, its just a small error in your code: instead of applying astype() to a numpy array like so:

data.description.values.astype('U')

Just swap statements and apply .astype() to the pandas series:

data.description.astype('U').values

Hope this helps!

score 0 · Answer 2 · answered Apr 24 '18 at 01:52

0

In case someone wants to know, I found one way to do this Python: use .loc to select data produces different results

Maybe this is a dumb question, please let me know if you think I should remove it, thanks.

answered Apr 24 '18 at 01:52

RyanKao

321
1
5
14

score 0 · Answer 3 · answered Jun 12 '18 at 02:45

I have experienced precisely this error as well. The astype function returns so quickly and with so little additional memory actually allocated, I can only assume a preemptive calculation of a memory requirement is failing, its failing to find a contiguous memory block, or it's a bug.

I couldn't find much at all on how to solve this problem, so I avoided it by removing the astype conversion completely and converting the underlying data set to Unicode before it is loaded by pandas.

Python: how to avoid MemoryError when transform text data into Unicode using astype('U')

3 Answers3

Linked