dtm <- DocumentTermMatrix(reuters, control=list(wordLengths=c(1,Inf)))
I am thinking of turning dtm into a term-term matrix, what's below is incorrect:
dtm <- dtm %*% t(dtm)
How might it be done?
dtm <- DocumentTermMatrix(reuters, control=list(wordLengths=c(1,Inf)))
I am thinking of turning dtm into a term-term matrix, what's below is incorrect:
dtm <- dtm %*% t(dtm)
How might it be done?
If I understand the structure of a document-term matrix correctly, it is t(dtm) %*% dtm
. See this answer.
I believe an approach as follows would work (note you are creating Boolean or maybe and adjacency matrix):
t(as.matrix(dtm)) %*% as.matrix(dtm)
For big dtm you will bounce into R's limits using as.matrix
. The Matrix
package can help. Note I switch i
and j
to do the transpose in the first matrix.
data("acq")
dtm <- DocumentTermMatrix(acq, control=list(wordLengths=c(1,Inf)))
tdm <- t(dtm)
library(Matrix)
Xt <- sparseMatrix(j=dtm$i, i=dtm$j, x=dtm$v)
X <- sparseMatrix(j=tdm$i, i=tdm$j, x=tdm$v)
Xt %*% X
# For easier viewing
(Xt %*% X) [1:20, 1:20]
TDM <- TermDocumentMatrix(x) # Form a Term document matrix
termDocMatrix <- as.matrix(TDM) # convert your TDM into a matrix
termDocMatrix[termDocMatrix>=1] <- 1 # change the TDM into Boolean matrix
# term adjacency matrix
termMatrix <- termDocMatrix %*% t(termDocMatrix)
termMatrix[1:10,1:10] # inspect terms numbered 1 to 10