how to remove empty documents from document term matrix in R

Question

I am performing kmeans clustering for twitter data, for which I am cleaning the tweets and creating a corpus. Later I find the dtm and use the tf-idf theory.

But my dtm has few empty documents which I want to remove because kmeans can't run for empty docs.

Here is my code:

removeURL <- function(x) gsub("http[[:alnum:][:punct:]]*", "", x) 
replacePunctuation <- function(x)
{
  x <- tolower(x)
  x <- gsub("[.]+[ ]"," ",x)
  x <- gsub("[:]+[ ]"," ",x)
  x <- gsub("[?]"," ",x)
  x <- gsub("[!]"," ",x)
  x <- gsub("[;]"," ",x)
  x <- gsub("[,]"," ",x)
  x <- gsub("[@]"," ",x)
  x <- gsub("[???]"," ",x)
  x <- gsub("[€]"," ",x)
  x

}

myStopwords <- c(stopwords('english'), "rt")


#preprocessing
tweet_corpus <- Corpus(VectorSource(tweet_raw$text))
tweet_corpus_clean <- tweet_corpus %>%
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeNumbers) %>%
  tm_map(removeWords,myStopwords) %>%
  tm_map(content_transformer(replacePunctuation)) %>%
  tm_map(stripWhitespace)%>%
  tm_map(content_transformer(removeURL))


dtm <- DocumentTermMatrix(tweet_corpus_clean ) 

#tf-idf

mat4 <- weightTfIdf(dtm) #when i run this, i get 2 docs that are empty
mat4 <- as.matrix(mat4)

Can you post a sample of tweet_raw$text? – KenHBS Dec 02 '17 at 11:19 — KenHBS, Dec 02 '17 at 11:19
And isn't `dtm` empty already? – KenHBS Dec 02 '17 at 11:40 — KenHBS, Dec 02 '17 at 11:40

score 0 · Answer 1 · answered Dec 02 '17 at 11:53

0

Obviously you can't do that with another tm_map.

But the text mining package also has tm_filter, which you can use to filter empty documents.

answered Dec 02 '17 at 11:53

Has QUIT--Anony-Mousse

76,138
12
138
194

score 0 · Answer 2 · answered Dec 02 '17 at 17:53

If your document does not contain any entry/word, then you could do this:

rowSumDoc <- apply(dtm, 1, sum) 
dtm2 <- dtm[rowSumDoc > 0, ]

Basically, above we are summing the words in each document first. Later, we are subsetting dtm for documents that are not empty based on earlier summation of words in each document.

how to remove empty documents from document term matrix in R

2 Answers2

Linked