Error when using cast_dtm with large corpus

Question

I am using cast_dtm command to convert the one-term-per-document-per-row dataframe to a document term matrix to be used as input to LDA. The code is:

posts_tokenized.dt %>% cast_dtm(id, word, term_frequency) -> posts.dtm

It worked fine with a corpus of 33,000 documents but is giving the following error on using a corpus of 147,242 documents.

Error in validObject(r) : invalid class "dgTMatrix" object: length(Dimnames[1]) differs from Dim[1] which is 147242

Any help is appreciated!

EDIT: The tokenized dataframe looks like this:

> head(df_tokenized)
# A tibble: 6 x 3
                         id           word term_frequency
                        <fctr>          <chr>          <int>
1 6013004059_10154817753659060 demonetisation              1
2 6013004059_10154828153334060 demonetisation              1
3 6013004059_10154835596219060 demonetisation              1
4 6013004059_10154837355359060 demonetisation              1
5 6013004059_10154872354154060 demonetisation              1
6 6013004059_10154556655804060         hanjin              1

None of the columns contain empty or NA values.

What's happening here is that `cast_dtm` is trying to make a sparse matrix, but the [dimensions of the row names (the documents) are not matching up](https://stackoverflow.com/questions/32353191/error-when-making-a-sparse-matrix). Something in the documents of your bigger corpus is making a weird entry. An empty document maybe? With no tokens? Can you show us what the tokenized, tidy data frame looks like? Is it factor data? — Julia Silge, Jul 12 '17 at 21:23
@JuliaSilge the error is rectified on converting the id column from factor to numeric. However, on the smaller dataset, which is exactly similar to the larger one I included in the question, I did not have to make this conversion. What could be the difference? — rakshita nagalla, Jul 13 '17 at 14:51
My instinct here is that it isn't the *size* of your dataset; it is one of the documents that it is in the larger dataset. I tried to make a reproducible example with factor document IDs and something weird like `NA` values, but could not reproduce this error. If you are able to figure out the specific document that causes this to break, please do let me know. It seems like an edge case that we might want to fix. — Julia Silge, Jul 13 '17 at 18:49

score 0 · Answer 1 · answered Nov 12 '19 at 22:10

0

I had the a similar problem happening, fixed it by turning the factor column to character

answered Nov 12 '19 at 22:10

Brn

11

Can you also add some code to show how to accomplish this? That would make the answer more useful – razdi Nov 12 '19 at 22:46

score 0 · Answer 2 · answered Jan 31 '20 at 10:51

0

Had similar problem got it fixed by changing to character:

dtm$ID<-as.character(dtm$ID)

In my case I had three columns, ID, Word and Count.

Changed ID and Word to character. Count is anyways int.

answered Jan 31 '20 at 10:51

Shuhom Choudhury

55
8

Error when using cast_dtm with large corpus

2 Answers2