0

For a NLP task, I am creating creating a document term matrix of which the dimensions are 4280 x 90141 with >98% zero's. The dense representation of this matrix requires a lot of memory and thus I would like to create it as a sparse matrix.

In this link they suggest to use Scipy. But as far as I understand, it requires the initialization of the dense matrix, before it makes the sparse conversion. Is there a package/available code that creates a sparse document-term representation without initializing a dense matrix firstly?

I am thinking about something like:

dense_doc_term = []

for doc in corpus:
    dense_doc_term.append(Counter(doc))

Would that be a good approach?

CJR
  • 3,916
  • 2
  • 10
  • 23
Emil
  • 1,531
  • 3
  • 22
  • 47
  • `This requires a lot of memory and thus I would like to create a **dense** representation for this matrix.` This doesn't sound right. Do you mean sparse? – Quang Hoang Mar 11 '21 at 21:46
  • Sorry for the unclarity. No, I mean that I want create a representation that only include non-zero values – Emil Mar 11 '21 at 21:48
  • Yes, that's the sparse matrix, which only indexes the non-zero terms, not dense. – Quang Hoang Mar 11 '21 at 21:50
  • Thanks, is it better now? – Emil Mar 11 '21 at 21:57
  • Spend some reading the sparse matrix docs. – hpaulj Mar 11 '21 at 22:20
  • It sounds like you want to just use this: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html – CJR Mar 11 '21 at 22:21
  • There are plenty of more recent [scipy] [sparse-matrix] tagged posts. – hpaulj Mar 11 '21 at 22:34
  • @CJR, doesnt that require me to use the fit_transform() method whenever I want to perform calculations (such as getting the total count of one word in all documents?) – Emil Mar 11 '21 at 22:54
  • 1
    Basically a sparse matrix is created from arrays of row coordinates, column coordinates, and corresponding values. There are variations for special layouts, but for a start focus on the `coo` format and its inputs. The docs also describe the storage. It helps to experiement, making small matrices and examining the attributes. – hpaulj Mar 12 '21 at 01:15

0 Answers0