I've created a DocumentTermMatrix similar to the one in this post:
Keep document ID with R corpus
Where I've maintained the doc_id so I can join the data back to a larger data set.
My issue is that I can't figure out how to summarize the words and word count and keep the doc_id. I'd like to be able to join this data to an existing data set using only 3 columns (doc_id, word, freq).
Without needing the doc_id, this is straight forward and I use this code to get my end result.
df_source=DataframeSource(df)
df_corpus=VCorpus(df_source)
tdm=TermDocumentMatrix(df_corpus)
tdm_m=as.matrix(tdm)
word_freqs=sort(rowSums(tdm_m), decreasing = TRUE)
tdm_sorted=data.frame(word = names(word_freqs), freq = word_freqs)
I've tried several different approaches to this and just cannot get it to work. This is where I am now (image). I've used this code:
tdm_m=cbind("doc.id" =rownames(tdm_m),tdm_m)
to move the doc_id into a column in the matrix, but cannot get the numeric columns to sum and keep the doc_id associated.
Any help, greatly appreciated, thanks!
Expected result:
doc.id | word | frequency
1 | Apple | 2
2 | Apple | 1
3 | Banana | 4
3 | Orange | 1
4 | Pear | 3