0

I have a term document matrix (tdm) in R (created from a corpus of around 16,000 texts) and I'm trying to create a distance matrix, but it's not loading and I'm not sure how long its supposed to take(it's already been over 20 minutes). I also tried creating a distance matrix using the document term matrix format, but it still does not load. Is there anything I can do to speed up the process. For the tdm, the rows are the text documents and the columns are the possible words, so the entries in the cells of the matrix are counts of each given word per document. this is what my code looks like:

library(tm)
library(slam)
library(dplyr)
library(XLConnect)
wb <- loadWorkbook("Descriptions.xlsx")
df <- readWorksheet(wb, sheet=1) 
docs <- Corpus(VectorSource(df$Long_Descriptions))
docs <- tm_map(docs, removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(content_transformer(tolower), lazy = TRUE) %>%
  tm_map(removeWords, stopwords("english"), lazy = TRUE) %>%
  tm_map(stemDocument, language = c("english"), lazy = TRUE) 
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs, control = list(removePunctuation = TRUE, stopwords = TRUE))
z<-as.matrix(dist(t(tdm), method = "cosine"))

(I know my code should be reproducible, but I'm not sure how I can share my data. The excel document has one column entitle Long_Descriptions, and example of row values are separated by commas as followed: I like cats, I am a dog person, I have three bunnies, I am a cat person but I want a pet rabbit)

Deb Martin
  • 51
  • 12

1 Answers1

0

Cosine distance is a simple dot product of two matrices with L2 normalization. In your case it even simpler - product of L2 normalized dtm on dtm transposed. Here is reproducible example using Matrix and text2vec packages:

library(text2vec)
library(Matrix)

cosine <- function(m) {
  m_normalized <- m / sqrt(rowSums(m ^ 2))
  tcrossprod(m_normalized)
}

data("movie_review")
data = rep(movie_review$review, 3)
it = itoken(data, tolower, word_tokenizer)
v = create_vocabulary(it) %>% 
  prune_vocabulary(term_count_min = 5)
vectorizer = vocab_vectorizer(v)
it = itoken(data, tolower, word_tokenizer)
dtm = create_dtm(it, vectorizer)
dim(dtm)
# 15000 24548

system.time( dtm_cos <- cosine(dtm) )
# user  system elapsed 
# 41.914   6.963  50.761 
dim(dtm)
# 15000 15000

EDIT: For tm package see this question: R: Calculate cosine distance from a term-document matrix with tm and proxy

Community
  • 1
  • 1
Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38