1

I am trying to create a dendrogram in r based off an excel sheet for use in text mining. I have one large column, each cell with a string of text. I want the smallest branch of the dendrogram to represent an individual cell, yet when I run my script I instead get a dendrogram of every word within the entire excel file. How do I fix this?

library(tm)
library(stringi)
library(proxy)
Data <- read.csv(file.choose(),header=TRUE)
docs <- Corpus(VectorSource(Data))

docs[[1]]

docs1 <- tm_map(docs, PlainTextDocument)
docs2 <- tm_map(docs1, stripWhitespace)
docs3 <- tm_map(docs2, removeWords, stopwords("english"))
docs4 <- tm_map(docs3, removePunctuation)
docs5 <- tm_map(docs4, content_transformer(tolower))

docs5[[1]]

TermMatrix <- TermDocumentMatrix(docs5)
docsdissim <- dist(as.matrix(TermMatrix), method = "euclidean")
docsdissim2 <- as.matrix(docsdissim)
docsdissim2

h <- hclust(docsdissim, method = "ward.D2")
Brodinsky
  • 11
  • 1
  • You need to transpose the TDM and then calculate distance. Either t(TermMatrix) or TermMatrix <- DocumentTermMatrix(docs5) should do the trick – emilliman5 Oct 20 '16 at 13:13
  • Also, if your corpus is large, I would look into calculating your distance matrix without coercing your DTM into a matrix, you can run out of memory really quick... Also, usually the "cosine" distance is used to calculate document similarity (my two cents as a fellow text miner) – emilliman5 Oct 20 '16 at 13:16
  • @emilliman5 Thanks for the help. I'm certain that did the trick. If you don't mind, could you elaborate on calculating without coercing? It looks like I have in fact run out of memory and I'm not sure how to accomplish what you've suggested as when I try the function without the matrix it throws up `Error in crossprod(x, y)/sqrt(crossprod(x) * crossprod(y)) non-conformable arrays` – Brodinsky Oct 20 '16 at 13:52
  • DTM/TDMs are sparse matrices, that is they only retain the non-zero values to save on memory space. Coercion turns the sparse matrix into a dense matrix and fills in all of the zeros making a vector of size n x m. Check out this [SO post](http://stackoverflow.com/questions/5560218/computing-sparse-pairwise-distance-matrix-in-r) for insight on implementing your own distance calc with sparse matrices. – emilliman5 Oct 20 '16 at 14:12
  • It looks like you tried calculating dist(TermMatrix), that won't work because TermMatrix is not "an actual matrix." Try it as you had before docsdissim <- dist(as.matrix(TermMatrix), method = "euclidean"). If you run out of memory then you will need to look into a method that natively works with sparse matrices. – emilliman5 Oct 20 '16 at 14:16

0 Answers0