Summarizing R corpus with doc ID

Question

I've created a DocumentTermMatrix similar to the one in this post:

Where I've maintained the doc_id so I can join the data back to a larger data set.

My issue is that I can't figure out how to summarize the words and word count and keep the doc_id. I'd like to be able to join this data to an existing data set using only 3 columns (doc_id, word, freq).

Without needing the doc_id, this is straight forward and I use this code to get my end result.

df_source=DataframeSource(df)
df_corpus=VCorpus(df_source)
tdm=TermDocumentMatrix(df_corpus) 
tdm_m=as.matrix(tdm)

word_freqs=sort(rowSums(tdm_m), decreasing = TRUE)
tdm_sorted=data.frame(word = names(word_freqs), freq = word_freqs)

I've tried several different approaches to this and just cannot get it to work. This is where I am now (image). I've used this code:

tdm_m=cbind("doc.id" =rownames(tdm_m),tdm_m)

to move the doc_id into a column in the matrix, but cannot get the numeric columns to sum and keep the doc_id associated.

Any help, greatly appreciated, thanks!

Expected result:

Please add a small expected output to question. – phiver Sep 07 '18 at 17:38 — phiver, Sep 07 '18 at 17:38
Updated original question with expected result. – Seth Brundle Sep 12 '18 at 20:07 — Seth Brundle, Sep 12 '18 at 20:07

score 0 · Accepted Answer · answered Sep 13 '18 at 09:14

If I look at your expected output, you don't need to use this line of code word_freqs=sort(rowSums(tdm_m), decreasing = TRUE). Because this creates a total sum of the word, like Apple = 3 instead of 2 and 1 over multiple documents.

To get to the output you want, instead of using TermDocumentMatrix, using DocumentTermMatrix is slightly easier. No need in switching columns around. I'm showing you two examples on how to get the result. One with melt from the reshape2 package and one with the tidy function from the tidytext package.

# example 1
dtm <- DocumentTermMatrix(df_corpus)
dtm_df <- reshape2::melt(as.matrix(dtm))
# remove 0 values and order the data.frame
dtm_df <- dtm_df[dtm_df$value > 0, ]
dtm_df <- dtm_df[order(dtm_df$value, decreasing = TRUE), ]

or using tidytext::tidy to get the data into a tidy format. No need to remove the 0 values as tidytext doesn't transform it into a matrix before casting it into a data.frame

# example 2
dtm_tidy <- tidytext::tidy(dtm)
# order the data.frame or start using dplyr syntax if needed
dtm_tidy <- dtm_tidy[order(dtm_tidy$count, decreasing = TRUE), ]

In my tests tidytext is a lot faster and uses less memory as there is no need to first create a dense matrix.

Summarizing R corpus with doc ID

1 Answers1