I am performing kmeans clustering for twitter data, for which I am cleaning the tweets and creating a corpus. Later I find the dtm and use the tf-idf theory.
But my dtm has few empty documents which I want to remove because kmeans can't run for empty docs.
Here is my code:
removeURL <- function(x) gsub("http[[:alnum:][:punct:]]*", "", x)
replacePunctuation <- function(x)
{
x <- tolower(x)
x <- gsub("[.]+[ ]"," ",x)
x <- gsub("[:]+[ ]"," ",x)
x <- gsub("[?]"," ",x)
x <- gsub("[!]"," ",x)
x <- gsub("[;]"," ",x)
x <- gsub("[,]"," ",x)
x <- gsub("[@]"," ",x)
x <- gsub("[???]"," ",x)
x <- gsub("[€]"," ",x)
x
}
myStopwords <- c(stopwords('english'), "rt")
#preprocessing
tweet_corpus <- Corpus(VectorSource(tweet_raw$text))
tweet_corpus_clean <- tweet_corpus %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords,myStopwords) %>%
tm_map(content_transformer(replacePunctuation)) %>%
tm_map(stripWhitespace)%>%
tm_map(content_transformer(removeURL))
dtm <- DocumentTermMatrix(tweet_corpus_clean )
#tf-idf
mat4 <- weightTfIdf(dtm) #when i run this, i get 2 docs that are empty
mat4 <- as.matrix(mat4)