I'm new to Stackoverflow and I've been doing my best to follow the guidelines. If there's however something I've missed, please let me know.
Lately I've been playing around with text mining in R; something I'm a novice towards. I've been using the packages you can find in the code nested below to do this. However, problem occurs when the wordcloud displays the Swedish letters å, ä and ö. As you can see in the attached picture the dots gets positioned a bit weird.
I've been trying as best as I could solving this by myself, but whatever I've been trying, I can't seem to get it to work.
What I've tried to do:
- Use
Encoding(tweets) <- "UTF-8"
in an attempt to settweets
to UTF-8 - Use
iconv(tweets, from = "UTF-8", to = "UTF-8", sub = "")
Furthermore, the last part of the code after defining the corpus vecotr was copied from the author of the tm-package. He listed this as the solution after other people mentioning problems with the wordcloud function with the corpus vector as input. Without it I get an error message when trying to create the wordcloud.
#Get and load necessary packages:
install.packages("twitteR")
install.packages("ROAuth")
install.packages("wordcloud")
install.packages("tm")
library("tm")
library("wordcloud")
library("twitteR")
library("ROAuth")
#Authentication:
api_key <- "XXX"
api_secret <- "XXX"
access_token <- "XXX"
access_token_secret <- "XXX"
cred <- setup_twitter_oauth(api_key,api_secret,access_token,
access_token_secret)
#Extract tweets:
search.string <- "#svpol"
no.of.tweets <- 3200
tweets <- searchTwitter(search.string, n=no.of.tweets, since = "2017-01-01")
tweets.text <- sapply(tweets, function(x){x$getText()})
#Remove tweets that starts with "RT" (retweets):
tweets.text <- gsub("^\bRT", "", tweets.text)
#Remove tabs:
tweets.text <- gsub("[ |\t]{2,}", "", tweets.text)
#Remove usernames:
tweets.text <- gsub("@\\w+", "", tweets.text)
tweets.text <- (tweets.text[!is.na(tweets.text)])
tweets.text <- gsub("\n", " ", tweets.text)
#Remove links:
tweets.text <- gsub("http[^[:space:]]*", "", tweets.text)
#Remove stopwords:
stopwords_swe <- c("är", "från", "än")
#Just a short example above, the real one is very large
tweets.text <- removeWords(tweets.text,stopwords_swe)
#Create corpus:
tweets.text.corpus <- Corpus(VectorSource(tweets.text))
#See notes in the longer text about the corpus vector
tweets.text.corpus <- tm_map(tweets.text.corpus,
content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')), mc.cores=1)
tweets.text.corpus <- tm_map(tweets.text.corpus, content_transformer(tolower), mc.cores=1)
tweets.text.corpus <- tm_map(tweets.text.corpus, removePunctuation, mc.cores=1)
tweets.text.corpus <- tm_map(tweets.text.corpus, function(x)removeWords(x,stopwords(kind = "en")), mc.cores=1)
wordcloud <- wordcloud(tweets.text.corpus, min.freq = 10,
max.words=300, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Set2"))
wordcloud
Would be super happy receiving help with this!