0

I have a MySQL database containing a table of names. Once I pull them into R I run into some encoding problems which I don't understand but seem to be all related to the way R interpreted the file.

con <- dbConnect(MySQL(),
                 host = 'localhost',
                 user = 'root',
                 dbname='test',
                 rstudioapi::askForPassword('Database password')
)

and pull the names table from my MySQL db with

df <- dbReadTable(con, 'names')

My table looks something like

df <- structure(list(id = c(373863, 17731249), name = c("AARESTRUP, JAN CARO̸E", 
"BÜTTNER, CHRISTIAN")), .Names = c("id", "name"), class = "data.frame", row.names = c(NA, 
-2L))


df
        id                   name
1   373863 AARESTRUP, JAN CARO̸E
2 17731249    BÜTTNER, CHRISTIAN

Checking the encoding brings

Encoding(df$name)
[1] "unknown" "unknown"

Using Encoding(df$name) <- "UTF-8" fixes the second name but turns parts of name one into a non UTF-8 character.

df
        id                         name
1   373863 AARESTRUP, JAN CARO<U+0338>E
2 17731249           BÜTTNER, CHRISTIAN

Now it gets strange as df$name[1] returns [1] "AARESTRUP, JAN CARO̸E" which is exactly how it is stored in the MySQL db while it seems like its differently interpreted when I call the whole df.

First, is there a way to ensure the dbReadTable() reads in automatically UTF-8. I found link which does not seem to change anything in terms of my table encoding.

Second, what is going on with the difference in the interpretation of the non UTF-8 encoding depending on how you access the name (whole df vs. single element)

CER
  • 854
  • 10
  • 22

1 Answers1

1

<U+0338> is a UTF-8 character. But it's an overlay character that when rendered occupies the same space as the O that preceeds it. For some reason, it's displaying it incorrectly (although I checked and in R it displays correctly, so it seems to be something to do with StackOverflow, or perhaps Firefox). You can use

apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')

on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.

I think that the difference in behavior between when you call df$name[1] and when you call df has to do with the different print/show methods for character type objects vs data.frames.

df wants to show you the individual characters, meaning it doesn't want to render the characters in a way that disguises the fact that the E with a slash is in fact two characters. It also doesn't want to display the character as a spacing character, since it would then look like a backslash. Showing the unicode for the character is the best way to let the user know it is there.

The show method for character type objects wants to render the characters for you (most of the time).

AColeman
  • 485
  • 3
  • 8