I'm working on data of ".txt" format. I am trying to perform text mining using the 'tm' library in R. My problem is , I always get a document term matrix of sparsity 0% irrespective of the data being sufficiently large or small. I am unable to obtain any kind of word associations, and I am also unable to obtain view-able cluster dendograms. I get an error message when trying to obatin a K means cluster plots to analyse my data. This is the code I used:
cname = file.path("F:","texts") #folder containing text data files
dir(cname)
library(tm)
docs <- Corpus(DirSource(cname))
## Preprocessing
docs <- tm_map(docs, removePunctuation) # *Removing punctuation:*
docs <- tm_map(docs, removeNumbers) # *Removing numbers:*
docs <- tm_map(docs, tolower) # *Converting to lowercase:*
docs <- tm_map(docs, removeWords, stopwords("english")) #Remove stopwords
library(SnowballC)
docs <- tm_map(docs, stemDocument) # *Removing common word endings*
docs <- tm_map(docs, stripWhitespace) # *Stripping whitespace
docs <- tm_map(docs, PlainTextDocument)
### Staging the Data
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(dtm)
tdm
freq <- colSums(as.matrix(dtm))
# removing sparse terms:
dtms <- removeSparseTerms(dtm, 0.1)
# Word Frequency
freq <- colSums(as.matrix(dtms))
### Term Correlations
findAssocs(dtm, c("young","politics"), corlimit=0.8)
### Hierarchal Clustering
dtms <- removeSparseTerms(dtm, 0.15)
library(cluster)
d <- dist(t(dtms), method="euclidian")
fit <- hclust(d=d, method="ward")
plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=5) # "k=" defines the number of clusters used
rect.hclust(fit, k=5, border="red")
### K-means clustering
library(fpc)
library(cluster)
dtms <- removeSparseTerms(dtm, 0.15)
d <- dist(t(dtms), method="euclidian")
kfit <- kmeans(d, 2)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)
This is the output I get when I inspect the term document matrix:
<<DocumentTermMatrix (documents: 1, terms: 1850)>>
Non-/sparse entries: 1850/0
Sparsity : 0%
Maximal term length: 23
Weighting : term frequency (tf)
Error when trying to get the K means cluster plot:
"Error in plot.window(...) : need finite 'xlim' values In addition: Warning message: In sqrt(detA * pmax(0, yl2 - y^2)) : NaNs produced"
Correlation output for any word is always 0. Cluster Dendogram plot unable to make sense
$young
numeric(0)
$politics
numeric(0)
I am also attaching the Cluster dendogram plot.