Cannot obtain Word Associations,Cluster dendograms and K-means clustering in R

Question

I'm working on data of ".txt" format. I am trying to perform text mining using the 'tm' library in R. My problem is , I always get a document term matrix of sparsity 0% irrespective of the data being sufficiently large or small. I am unable to obtain any kind of word associations, and I am also unable to obtain view-able cluster dendograms. I get an error message when trying to obatin a K means cluster plots to analyse my data. This is the code I used:

cname = file.path("F:","texts") #folder containing text data files
dir(cname)
library(tm)   
docs <- Corpus(DirSource(cname))   
## Preprocessing      
docs <- tm_map(docs, removePunctuation)   # *Removing punctuation:*    
docs <- tm_map(docs, removeNumbers)      # *Removing numbers:*    
docs <- tm_map(docs, tolower)   # *Converting to lowercase:*    
docs <- tm_map(docs, removeWords, stopwords("english"))   #Remove stopwords
library(SnowballC)   
docs <- tm_map(docs, stemDocument)   # *Removing common word endings* 
docs <- tm_map(docs, stripWhitespace)   # *Stripping whitespace   
docs <- tm_map(docs, PlainTextDocument)
### Staging the Data      
dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(dtm)
tdm     
freq <- colSums(as.matrix(dtm))   
#  removing sparse terms:   
dtms <- removeSparseTerms(dtm, 0.1)
# Word Frequency   
freq <- colSums(as.matrix(dtms))   
### Term Correlations
findAssocs(dtm, c("young","politics"), corlimit=0.8) 
### Hierarchal Clustering   
dtms <- removeSparseTerms(dtm, 0.15) 
library(cluster)   
d <- dist(t(dtms), method="euclidian")  
fit <- hclust(d=d, method="ward")   
plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=5)   # "k=" defines the number of clusters used   
rect.hclust(fit, k=5, border="red") 
### K-means clustering   
library(fpc)   
library(cluster)  
dtms <- removeSparseTerms(dtm, 0.15) 
d <- dist(t(dtms), method="euclidian")   
kfit <- kmeans(d, 2)   
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

This is the output I get when I inspect the term document matrix:

<<DocumentTermMatrix (documents: 1, terms: 1850)>>
Non-/sparse entries: 1850/0
Sparsity           : 0%
Maximal term length: 23
Weighting          : term frequency (tf)

Error when trying to get the K means cluster plot:

"Error in plot.window(...) : need finite 'xlim' values In addition: Warning message: In sqrt(detA * pmax(0, yl2 - y^2)) : NaNs produced"

Correlation output for any word is always 0. Cluster Dendogram plot unable to make sense

$young
numeric(0)

$politics
numeric(0)

I am also attaching the Cluster dendogram plot.

It would be easier to have a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) ? — Vincent Bonhomme, Mar 27 '16 at 15:29
is this code acceptable or should I make it even shorter ? This is a project which I'm doing so any help is highly appreciated. Thanks. — SSG_NJ, Mar 28 '16 at 16:08
Okay I figured it out. It was because all these functions need multiple documents in the corpus to work properly. I was using a single file in the corpus. Thank God! Cheers! — SSG_NJ, Mar 29 '16 at 18:00

score 0 · Accepted Answer · answered Mar 30 '16 at 14:10

0

Correlation and the other functions will work only if the corpus contains multiple files. My corpus had only one file so they did not produce output. Thanks anyway!

answered Mar 30 '16 at 14:10

SSG_NJ

3
1
5

Cannot obtain Word Associations,Cluster dendograms and K-means clustering in R

1 Answers1