Problems with non english letters using wordcloud by twitter mined text

Question

I'm new to Stackoverflow and I've been doing my best to follow the guidelines. If there's however something I've missed, please let me know.

Lately I've been playing around with text mining in R; something I'm a novice towards. I've been using the packages you can find in the code nested below to do this. However, problem occurs when the wordcloud displays the Swedish letters å, ä and ö. As you can see in the attached picture the dots gets positioned a bit weird.

Wordcloud image

I've been trying as best as I could solving this by myself, but whatever I've been trying, I can't seem to get it to work.

What I've tried to do:

Use Encoding(tweets) <- "UTF-8" in an attempt to set tweets to UTF-8
Use iconv(tweets, from = "UTF-8", to = "UTF-8", sub = "")

Furthermore, the last part of the code after defining the corpus vecotr was copied from the author of the tm-package. He listed this as the solution after other people mentioning problems with the wordcloud function with the corpus vector as input. Without it I get an error message when trying to create the wordcloud.

    #Get and load necessary packages:
    install.packages("twitteR")
    install.packages("ROAuth")
    install.packages("wordcloud")
    install.packages("tm")
    library("tm")
    library("wordcloud")
    library("twitteR")
    library("ROAuth") 

    #Authentication:
    api_key <- "XXX"
    api_secret <- "XXX"
    access_token <- "XXX"
    access_token_secret <- "XXX"
    cred <- setup_twitter_oauth(api_key,api_secret,access_token,
                access_token_secret)

    #Extract tweets:
    search.string <- "#svpol"
    no.of.tweets <- 3200
    tweets <- searchTwitter(search.string, n=no.of.tweets, since = "2017-01-01")
    tweets.text <- sapply(tweets, function(x){x$getText()})

    #Remove tweets that starts with "RT" (retweets):
    tweets.text <- gsub("^\bRT", "", tweets.text)
    #Remove tabs:
    tweets.text <- gsub("[ |\t]{2,}", "", tweets.text)
    #Remove usernames:
    tweets.text <- gsub("@\\w+", "", tweets.text)
    tweets.text <- (tweets.text[!is.na(tweets.text)])
    tweets.text <- gsub("\n", " ", tweets.text)
    #Remove links:
    tweets.text <- gsub("http[^[:space:]]*", "", tweets.text)
    #Remove stopwords:
    stopwords_swe <- c("är", "från", "än")
    #Just a short example above, the real one is very large
    tweets.text <- removeWords(tweets.text,stopwords_swe)

    #Create corpus:
    tweets.text.corpus <- Corpus(VectorSource(tweets.text))
    #See notes in the longer text about the corpus vector
    tweets.text.corpus <- tm_map(tweets.text.corpus,
                          content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')), mc.cores=1)
    tweets.text.corpus <- tm_map(tweets.text.corpus, content_transformer(tolower), mc.cores=1)
    tweets.text.corpus <- tm_map(tweets.text.corpus, removePunctuation, mc.cores=1)
    tweets.text.corpus <- tm_map(tweets.text.corpus, function(x)removeWords(x,stopwords(kind = "en")), mc.cores=1)

    wordcloud <- wordcloud(tweets.text.corpus, min.freq = 10,
                   max.words=300, random.order=FALSE, rot.per=0.35, 
                   colors=brewer.pal(8, "Set2"))
    wordcloud

Would be super happy receiving help with this!

Maybe [this post](http://stackoverflow.com/questions/16347731/how-to-change-the-locale-of-r-in-rstudio) can help. — Paulo MiraMor, Jan 26 '17 at 00:07
My problem is not the language in the R console itself, it's about the characters in the output. As far as I understand `Sys.setlocale()` is only about changing the language of the console, which I can't see why it would solve the issue...? (It's already in English if that somehow matters). Thank you anyway. — charlesos, Jan 26 '17 at 19:27

score 1 · Accepted Answer · answered Jan 30 '17 at 16:30

Managed to solve it by first encoding the vector to UTF-8-MAC (since I'm on OSX), then using the gsub() function in order to manually change the hex codes for å,ä,ö (the letters I had problems with) to the actual letters. For example gsub("0xc3 0x85", "å", x), gsub("0xc3 0xa5", "å", x) (since case sensitivity).

Lastly changing the argument for the tm_map() function from UTF-8-MAC to latin1. That did the trick for me, hopefully someone else will find this useful in the future.

Problems with non english letters using wordcloud by twitter mined text

What I've tried to do:

1 Answers1