I have a term document matrix (tdm) in R (created from a corpus of around 16,000 texts) and I'm trying to create a distance matrix, but it's not loading and I'm not sure how long its supposed to take(it's already been over 20 minutes). I also tried creating a distance matrix using the document term matrix format, but it still does not load. Is there anything I can do to speed up the process. For the tdm, the rows are the text documents and the columns are the possible words, so the entries in the cells of the matrix are counts of each given word per document. this is what my code looks like:
library(tm)
library(slam)
library(dplyr)
library(XLConnect)
wb <- loadWorkbook("Descriptions.xlsx")
df <- readWorksheet(wb, sheet=1)
docs <- Corpus(VectorSource(df$Long_Descriptions))
docs <- tm_map(docs, removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower), lazy = TRUE) %>%
tm_map(removeWords, stopwords("english"), lazy = TRUE) %>%
tm_map(stemDocument, language = c("english"), lazy = TRUE)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs, control = list(removePunctuation = TRUE, stopwords = TRUE))
z<-as.matrix(dist(t(tdm), method = "cosine"))
(I know my code should be reproducible, but I'm not sure how I can share my data. The excel document has one column entitle Long_Descriptions, and example of row values are separated by commas as followed: I like cats, I am a dog person, I have three bunnies, I am a cat person but I want a pet rabbit)