1

I'm using the "RMySQL" library in R to load data from a local MySQL DB into R:

con <- dbConnect(MySQL(), user="root", password="****", dbname="twitterdata", host="localhost")
dataframe <- dbGetQuery(con, "SELECT id, plaintext, category FROM table")

When I inspect the dataframe, I see a lot of unformatted characters such as the slanted apastrophe (´) which shows up as ’.

After some research, I discovered that according to this site, some special characters (including the slanted apastrophe) are not part of the ISO-8859-1 standard but of the Windows-1252 standard.

When I run

Sys.getlocale("LC_CTYPE")

in R, it says:

"German_Austria.1252"

Doesn't it already say that I'm on the correct encoding?! In my DB (Default Charset: UTF-8), the apostrophe is encoded well.

I also tried to add a parameter to the dbConnect statement DBMSencoding="utf-8" but with no effect.

When I run

Encoding(x)

in R (where x is the character vector - a sentence), the answer is

"unknown"

Does anybody know now to solve this issue?

Thanks a lot!

user944351
  • 1,213
  • 2
  • 19
  • 27
  • Does [this](http://stackoverflow.com/questions/30595862/r-encoding-utf-8-u0080-u009f/30596689#30596689) help? Afaik, a `'` should be part of iso-8859-x as well as utf-8. You probably need to encode it correctly within R. – lukeA Jun 03 '15 at 12:08
  • Yes the ' is part of the iso-8859-x, but not the ´ and the `. The interesing thing is that when I write the data to a file, it's correctly shown again. – user944351 Jun 03 '15 at 12:25
  • So, did you try `iconv(dataframe [, 1], from = "UTF-8", to = "latin1")` as suggested? It's hard to debug without having access to the actual data... – lukeA Jun 03 '15 at 13:48
  • Oh man! That works! I don't know how I could overlooked that! Thanks a lot!!! – user944351 Jun 03 '15 at 14:23
  • Okay, doesn't work either. When I do that, some entries simply become "NA"... – user944351 Jun 03 '15 at 14:42
  • `NA` the default for the `sub` parameter of `iconv`. See `?iconv`: _"If not NA it is used to replace any non-convertible bytes in the input. (This would normally be a single character, but can be more.) If "byte", the indication is "" with the hex code of the byte."_. However, as I said, without having the data, it's an endless guessing game. You should add an excerpt to your post: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. – lukeA Jun 03 '15 at 14:50

1 Answers1

2

Do it:

con <- dbConnect(MySQL(), user="root", password="****", dbname="twitterdata", host="localhost", encoding = "latin1")
Márcio Mocellin
  • 274
  • 5
  • 18