UTF-8 Character Encoding with TermDocumentMatrix

Question

I'm trying to learn R. I've been trying to solve this problem for hours. I've searched and tried lots of things to fix this but no luck so far. So here we go; I'm downloading some random tweets from twitter (via twitteR). I can see all special characters when i check my dataframe (like; üğıİşçÇöÖ). I'm removing some stuff (like whitespace etc.) After all removing and manipulating my corpus everything looks fine. Character encoding problem starts when i try to create TermDocumentMatrix. After that "tdm" and "df" has some weird symbols and maybe lost some characters?? Here is the code;

tweetsg.df <- twListToDF(tweets)
#looks good. no encoding problems.
wordCorpus <- Corpus(VectorSource(tweetsg.df$text))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
#wordCorpus looks fine at this point.
tdm <- TermDocumentMatrix(wordCorpus, control = list(tokenize="scan", 
wordLengths = c(3, Inf),language="Turkish"))
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >= 1)
df <- data.frame(term = names(term.freq), freq = term.freq)

At this point both tdm and df has weird symbols and missing characters.

What i've tried so far;

Tried to use different tokenizers. Also a custom one.
Changed Sys.setLocale to my own language.
used enc2utf8
Changed my system (windows 10) display language to my own language

Still no luck though! Any kind of help or pointers accepted :) PS: Non-english speaker AND R newbie here. Also if we can solve this i think i have a problem with emojis too. I would like to remove or even better USE them :)

Do you have some sample tweets? Even a few lines may be useful. — Jeremy Voisey, Apr 17 '17 at 17:56
Of course! But since encoding problems starts with TermDocumentMatrix, i can give you single words with errors; 100yÄ = Should be 100 yıl or 100yıl deÄŸiÅŸtirdik = Should be "değiştirdik" hayÄ = Should be "hayır" i guess. I suspect that this is one of the words which lost its last character "r" in this case too. — Mesut Aslan, Apr 17 '17 at 18:00
http://stackoverflow.com/questions/21251736/r-tm-package-utf-8-text — Jeremy Voisey, Apr 17 '17 at 18:41
Yeah i've tried my own tokenizer (which is suggested and accepted answer) it didn't changed anything for me sadly. — Mesut Aslan, Apr 17 '17 at 18:48
https://stackoverflow.com/a/51847835 https://stackoverflow.com/questions/51269083/error-in-nchartermsx-type-chars-invalid-multibyte-string-element-204 try this: Sys.setlocale('LC_ALL','C') — oguz, Dec 12 '20 at 12:01
https://stackoverflow.com/questions/51269083/error-in-nchartermsx-type-chars-invalid-multibyte-string-element-204 try this: Sys.setlocale('LC_ALL','C') # bu çözüyor — oguz, Dec 12 '20 at 12:02

score 1 · Accepted Answer · edited May 23 '17 at 12:32

I've managed to duplicate your issue, and make changes to get Turkish output. Try changing the line

wordCorpus <- Corpus(VectorSource(tweetsg.df$text))

to

wordCorpus <- Corpus(DataframeSource(data.frame(tweetsg.df$text)))

and adding a line similar to this.

Encoding(tweetsg.df$text)  <- "UTF-8"

The code I got to work was

library(tm)
sampleTurkish <- "değiştirdik değiştirdik değiştirdik"
Encoding(sampleTurkish)  <- "UTF-8"
#looks good. no encoding problems.
wordCorpus <- Corpus(DataframeSource(data.frame(sampleTurkish)))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
#wordCorpus looks fine at this point.
tdm <- TermDocumentMatrix(wordCorpus)
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >= 1)
df <- data.frame(term = names(term.freq), freq = term.freq)

print(findFreqTerms(tdm, lowfreq=2))

This only worked with a source command from the console. i.e. clicking on run or source button in RStudio didn't work. I also made sure I chose "Save with Encoding" "UTF-8" (although this is probably only necessary because I have turkish text)

> source("Turkish.R")
[1] "değiştirdik"

It was the second answer R tm package: utf-8 text that was useful in the end.

Thank you! I've tried the second answer too, but couldn't get it to work with corpus or dataframes. This solved my problem! Thanks again! — Mesut Aslan, Apr 17 '17 at 19:23

score 0 · Answer 2 · answered Mar 12 '18 at 18:01

I have a string vector with UTF-8 encoding from a postgreSQL database that throws the same errror, but none of the suggested solutions worked (see below for details). So my solution was to simply convert from UTF-8 to latin1 with the iconv function. Then I could create the Corpus with the normal VectorSource function.

# text: loaded from PostgreSQL in UTF-8
# convert to latin1
text <- iconv(text, "UTF-8", "latin1")

wordCorpus <- Corpus(VectorSource(text))

Maybe that can be helpful for somebody else.

Solutions that did not work for me: First I followed Jeremy's answer and changed from VectorSource to DataframeSource and the encoding to UTF-8, but then I got a new error:

Error: all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE

I found this thread (Error faced while using TM package's VCorpus in R), but the provided answers to create a data.frame by hand for the new version of the tm package did not work neither.

UTF-8 Character Encoding with TermDocumentMatrix

What i've tried so far;

2 Answers2

Linked