0

I'm working on data of ".txt" format. I am trying to perform text mining using the 'tm' library in R. My problem is , I always get a document term matrix of sparsity 0% irrespective of the data being sufficiently large or small. I am unable to obtain any kind of word associations, and I am also unable to obtain view-able cluster dendograms. I get an error message when trying to obatin a K means cluster plots to analyse my data. This is the code I used:

cname = file.path("F:","texts") #folder containing text data files
dir(cname)
library(tm)   
docs <- Corpus(DirSource(cname))   
## Preprocessing      
docs <- tm_map(docs, removePunctuation)   # *Removing punctuation:*    
docs <- tm_map(docs, removeNumbers)      # *Removing numbers:*    
docs <- tm_map(docs, tolower)   # *Converting to lowercase:*    
docs <- tm_map(docs, removeWords, stopwords("english"))   #Remove stopwords
library(SnowballC)   
docs <- tm_map(docs, stemDocument)   # *Removing common word endings* 
docs <- tm_map(docs, stripWhitespace)   # *Stripping whitespace   
docs <- tm_map(docs, PlainTextDocument)
### Staging the Data      
dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(dtm)
tdm     
freq <- colSums(as.matrix(dtm))   
#  removing sparse terms:   
dtms <- removeSparseTerms(dtm, 0.1)
# Word Frequency   
freq <- colSums(as.matrix(dtms))   
### Term Correlations
findAssocs(dtm, c("young","politics"), corlimit=0.8) 
### Hierarchal Clustering   
dtms <- removeSparseTerms(dtm, 0.15) 
library(cluster)   
d <- dist(t(dtms), method="euclidian")  
fit <- hclust(d=d, method="ward")   
plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=5)   # "k=" defines the number of clusters used   
rect.hclust(fit, k=5, border="red") 
### K-means clustering   
library(fpc)   
library(cluster)  
dtms <- removeSparseTerms(dtm, 0.15) 
d <- dist(t(dtms), method="euclidian")   
kfit <- kmeans(d, 2)   
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)      

This is the output I get when I inspect the term document matrix:

<<DocumentTermMatrix (documents: 1, terms: 1850)>>
Non-/sparse entries: 1850/0
Sparsity           : 0%
Maximal term length: 23
Weighting          : term frequency (tf)

Error when trying to get the K means cluster plot:

"Error in plot.window(...) : need finite 'xlim' values In addition: Warning message: In sqrt(detA * pmax(0, yl2 - y^2)) : NaNs produced"

Correlation output for any word is always 0. Cluster Dendogram plot unable to make sense

$young
numeric(0)

$politics
numeric(0)

I am also attaching the Cluster dendogram plot.

SSG_NJ
  • 3
  • 1
  • 5
  • 2
    It would be easier to have a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) ? – Vincent Bonhomme Mar 27 '16 at 15:29
  • okay I'll update the code – SSG_NJ Mar 28 '16 at 12:17
  • is this code acceptable or should I make it even shorter ? This is a project which I'm doing so any help is highly appreciated. Thanks. – SSG_NJ Mar 28 '16 at 16:08
  • Okay I figured it out. It was because all these functions need multiple documents in the corpus to work properly. I was using a single file in the corpus. Thank God! Cheers! – SSG_NJ Mar 29 '16 at 18:00
  • Nice! you can answer you own question and/or close it. – Vincent Bonhomme Mar 29 '16 at 18:55

1 Answers1

0

Correlation and the other functions will work only if the corpus contains multiple files. My corpus had only one file so they did not produce output. Thanks anyway!

SSG_NJ
  • 3
  • 1
  • 5