Creating sparse matrix from dense matrix without initializing the dense matrix firstly

Asked Mar 11 '21 at 21:44

Active Mar 11 '21 at 22:18

Viewed 170 times

For a NLP task, I am creating creating a document term matrix of which the dimensions are 4280 x 90141 with >98% zero's. The dense representation of this matrix requires a lot of memory and thus I would like to create it as a sparse matrix.

In this link they suggest to use Scipy. But as far as I understand, it requires the initialization of the dense matrix, before it makes the sparse conversion. Is there a package/available code that creates a sparse document-term representation without initializing a dense matrix firstly?

I am thinking about something like:

dense_doc_term = []

for doc in corpus:
    dense_doc_term.append(Counter(doc))

Would that be a good approach?

edited Mar 11 '21 at 22:18

CJR

3,916
2
10
23

asked Mar 11 '21 at 21:44

Emil

1,531
3
22
47

`This requires a lot of memory and thus I would like to create a **dense** representation for this matrix.` This doesn't sound right. Do you mean sparse? – Quang Hoang Mar 11 '21 at 21:46
Sorry for the unclarity. No, I mean that I want create a representation that only include non-zero values – Emil Mar 11 '21 at 21:48
Yes, that's the sparse matrix, which only indexes the non-zero terms, not dense. – Quang Hoang Mar 11 '21 at 21:50
Thanks, is it better now? – Emil Mar 11 '21 at 21:57
Spend some reading the sparse matrix docs. – hpaulj Mar 11 '21 at 22:20
It sounds like you want to just use this: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html – CJR Mar 11 '21 at 22:21
There are plenty of more recent [scipy] [sparse-matrix] tagged posts. – hpaulj Mar 11 '21 at 22:34
@CJR, doesnt that require me to use the fit_transform() method whenever I want to perform calculations (such as getting the total count of one word in all documents?) – Emil Mar 11 '21 at 22:54
1

Basically a sparse matrix is created from arrays of row coordinates, column coordinates, and corresponding values. There are variations for special layouts, but for a start focus on the `coo` format and its inputs. The docs also describe the storage. It helps to experiement, making small matrices and examining the attributes. – hpaulj Mar 12 '21 at 01:15

Creating sparse matrix from dense matrix without initializing the dense matrix firstly

0 Answers0