I have a MySQL database containing a table of names. Once I pull them into R I run into some encoding problems which I don't understand but seem to be all related to the way R interpreted the file.
con <- dbConnect(MySQL(),
host = 'localhost',
user = 'root',
dbname='test',
rstudioapi::askForPassword('Database password')
)
and pull the names
table from my MySQL db with
df <- dbReadTable(con, 'names')
My table looks something like
df <- structure(list(id = c(373863, 17731249), name = c("AARESTRUP, JAN CARO̸E",
"BÜTTNER, CHRISTIAN")), .Names = c("id", "name"), class = "data.frame", row.names = c(NA,
-2L))
df
id name
1 373863 AARESTRUP, JAN CARO̸E
2 17731249 BÜTTNER, CHRISTIAN
Checking the encoding brings
Encoding(df$name)
[1] "unknown" "unknown"
Using Encoding(df$name) <- "UTF-8"
fixes the second name but turns parts of name one into a non UTF-8 character.
df
id name
1 373863 AARESTRUP, JAN CARO<U+0338>E
2 17731249 BÜTTNER, CHRISTIAN
Now it gets strange as df$name[1]
returns
[1] "AARESTRUP, JAN CARO̸E"
which is exactly how it is stored in the MySQL db while it seems like its differently interpreted when I call the whole df
.
First, is there a way to ensure the dbReadTable()
reads in automatically UTF-8
. I found link which does not seem to change anything in terms of my table encoding.
Second, what is going on with the difference in the interpretation of the non UTF-8 encoding depending on how you access the name (whole df vs. single element)