I have a vocabulary of about 50,000 terms and a corpus of about 20,000 documents in a Pandas DataFrame like this:
import pandas as pd
vocab = {"movie", "good", "very"}
corpus = pd.DataFrame({
"ID": [100, 200, 300],
"Text": ["It's a good movie", "Wow it's very good", "Bad movie"]
})
The following code produces a SciPy CSR matrix in about 5 seconds only:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(use_idf = False, ngram_range = (1, 2),
vocabulary = vocab)
vec.transform(corpus["Text"])
However, converting the CSR matrix to a Pandas SparseDataFrame is so slow that I've to abort it:
dtm = pd.SparseDataFrame(vec.transform(corpus["Text"]))
dtm["ID"] = corpus["ID"]
Attempted Solutions
I tried appending .tocoo()
to vec.transform(corpus["Text"])
but it makes no difference in speed. Appending .toarray()
is no good either since it returns
ValueError: array is too big;
arr.size * arr.dtype.itemsize
is larger than the maximum possible size
I also tried SparseSeries as suggested at stackoverflow.com/q/17818783 but it resulted in a MemoryError:
tfidf = vec.transform(corpus["Text"])
dtm = pd.SparseDataFrame( [pd.SparseSeries(tfidf[k].toarray().ravel())
for k in range(tfidf.shape[0])] )
The MemoryError cannot be resolved by changing the list comprehension to a generator expression, because the latter returns UnboundLocalError: local variable 'mgr' referenced before assignment
I need a SparseDataFrame because I want to join / merge the ID
column with another DataFrame. Is there a faster way?