1

I am performing kmeans clustering for twitter data, for which I am cleaning the tweets and creating a corpus. Later I find the dtm and use the tf-idf theory.

But my dtm has few empty documents which I want to remove because kmeans can't run for empty docs.

Here is my code:

removeURL <- function(x) gsub("http[[:alnum:][:punct:]]*", "", x) 
replacePunctuation <- function(x)
{
  x <- tolower(x)
  x <- gsub("[.]+[ ]"," ",x)
  x <- gsub("[:]+[ ]"," ",x)
  x <- gsub("[?]"," ",x)
  x <- gsub("[!]"," ",x)
  x <- gsub("[;]"," ",x)
  x <- gsub("[,]"," ",x)
  x <- gsub("[@]"," ",x)
  x <- gsub("[???]"," ",x)
  x <- gsub("[€]"," ",x)
  x

}

myStopwords <- c(stopwords('english'), "rt")


#preprocessing
tweet_corpus <- Corpus(VectorSource(tweet_raw$text))
tweet_corpus_clean <- tweet_corpus %>%
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeNumbers) %>%
  tm_map(removeWords,myStopwords) %>%
  tm_map(content_transformer(replacePunctuation)) %>%
  tm_map(stripWhitespace)%>%
  tm_map(content_transformer(removeURL))


dtm <- DocumentTermMatrix(tweet_corpus_clean ) 

#tf-idf

mat4 <- weightTfIdf(dtm) #when i run this, i get 2 docs that are empty
mat4 <- as.matrix(mat4)  
Claus Wilke
  • 16,992
  • 7
  • 53
  • 104
gaurav v
  • 63
  • 2
  • 9

2 Answers2

0

Obviously you can't do that with another tm_map.

But the text mining package also has tm_filter, which you can use to filter empty documents.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

If your document does not contain any entry/word, then you could do this:

rowSumDoc <- apply(dtm, 1, sum) 
dtm2 <- dtm[rowSumDoc > 0, ] 

Basically, above we are summing the words in each document first. Later, we are subsetting dtm for documents that are not empty based on earlier summation of words in each document.

Santosh M.
  • 2,356
  • 1
  • 17
  • 29