Converting SciPy CSR matrix to Pandas SparseDataFrame is too slow

Question

I have a vocabulary of about 50,000 terms and a corpus of about 20,000 documents in a Pandas DataFrame like this:

import pandas as pd
vocab = {"movie", "good", "very"}
corpus = pd.DataFrame({
    "ID": [100, 200, 300],
    "Text": ["It's a good movie", "Wow it's very good", "Bad movie"]
})

The following code produces a SciPy CSR matrix in about 5 seconds only:

from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(use_idf = False, ngram_range = (1, 2),
                      vocabulary = vocab)
vec.transform(corpus["Text"])

However, converting the CSR matrix to a Pandas SparseDataFrame is so slow that I've to abort it:

dtm = pd.SparseDataFrame(vec.transform(corpus["Text"]))
dtm["ID"] = corpus["ID"]

Attempted Solutions

I tried appending .tocoo() to vec.transform(corpus["Text"]) but it makes no difference in speed. Appending .toarray() is no good either since it returns

ValueError: array is too big; arr.size * arr.dtype.itemsize is larger than the maximum possible size

I also tried SparseSeries as suggested at stackoverflow.com/q/17818783 but it resulted in a MemoryError:

tfidf = vec.transform(corpus["Text"])
dtm = pd.SparseDataFrame( [pd.SparseSeries(tfidf[k].toarray().ravel())
                           for k in range(tfidf.shape[0])] )

The MemoryError cannot be resolved by changing the list comprehension to a generator expression, because the latter returns UnboundLocalError: local variable 'mgr' referenced before assignment

I need a SparseDataFrame because I want to join / merge the ID column with another DataFrame. Is there a faster way?

https://pandas.pydata.org/pandas-docs/stable/sparse.html#sparsedataframe indicates that all sparse formats will be converted to `coo`. `coo` format has 3 key attributes, `rows`, `cols`, `data`, together defining all nonzero values. — hpaulj, Aug 02 '18 at 03:26
What's the shape and size (number of non zero terms) of the sparse matrix? (the `repr` print should tell you that). Make sure you really want a data frame with the same number of rows and columns, even it is sparse. While I've worked quite a bit with `scipy.sparse`, I've only had cursory contact with the pandas version. That's still under development. But if they've provided a constructor that takes such a matrix as input, I don't see how we can do anything faster, at least without knowing in detail what it does. — hpaulj, Aug 02 '18 at 03:31
@hpaulj The shape is 20,000 by 50,000 with 200,000 elements. I need a SparseDataFrame to join with other DataFrame (last paragraph of my question) unless there's an alternative? — farmer, Aug 02 '18 at 03:51
Is this other `DataFrame` also sparse? It might help if you demonstrated this with a much smaller example, something where you can display the `.toarray()` of the sparse matrix, and the result `SparseDataFrame`. And the merger. — hpaulj, Aug 02 '18 at 03:51
[Creating a very large sparse matrix csv from a list of condensed data](https://stackoverflow.com/questions/48329815/creating-a-very-large-sparse-matrix-csv-from-a-list-of-condensed-data). https://stackoverflow.com/questions/46750202/how-come-pandas-pd-sparsedataframe-never-completes-for-large-data-sets, also has problems when number of columns is very large. — hpaulj, Aug 02 '18 at 04:01
[Why does it take so long to create a SparseDataFrame (Python pandas)?](https://stackoverflow.com/questions/41369868/why-does-it-take-so-long-to-create-a-sparsedataframe-python-pandas). Back then I deduced that a sparse dataframe consists of individual sparse series - one per column. In your case that's 1000s of series. — hpaulj, Aug 02 '18 at 04:07
If these deductions regards columns is right, then making Sparse dataframe from the transpose of your matrix should be a bit faster - with only 20,000 columns, rather than 50,000!. — hpaulj, Aug 02 '18 at 04:42
In one of the links you posted, it's said that "when I increase the number of columns from 1000 to say 10000, the code seems to take forever and I always had to abort it". So I don't think 20,000 columns will work. — farmer, Aug 05 '18 at 00:00

Converting SciPy CSR matrix to Pandas SparseDataFrame is too slow

Attempted Solutions

0 Answers0