0

Seems like I opened another chapter of the encoding from hell book. I seek help with a problem I encounter when pulling and writing data from\to a MySQL data base with R. After a good amount of time I was able to write my data back but still don't understand what exactly is going on.

library(RMySQL)
library(dbplyr)

con <- dbConnect(MySQL(), 
         host = "localhost",
         user = "root",
         dbname="test",
         password = rstudioapi::askForPassword("Database password"))

address <- as_tibble(tbl(con, "address")) 

The pulled address dataframe looks like

address <- structure(list(address_id = c(1809463, 2213341, 2614879, 4536353
), street = c("5, RUE DU GRAND CORMORAN APPT. C15", "14, PLACE EGLISE", 
"1058 TENNESSEE", "38 ALLEE GERARD DE NERVAL"), city = c("31240 L A°NION", 

"85140 L AÂIE", "ELK GROVE VILLAGE A¨LLINOIS 60007", "F-69360 SAINT-
SYPHORIEN D AÂZON"
)), .Names = c("address_id", "street", "city"), row.names = c(NA, 
-4L), class = c("tbl_df", "tbl", "data.frame"))

You can see right away that there is some encoding issues in address$city so I run

address$city <-  iconv(address$city, from = "UTF-8", "windows-1252")

which seems to fix it as everything looks fine now but as soon as I want to write the file back to the MySQL I run into problems with the encoding again getting following error

dbWriteTable(con, value =address, name = "address_cleaned", overwrite=TRUE ,rownames = FALSE ) 

Error in .local(conn, statement, ...) : could not run statement: Invalid utf8 character string: '31240 L A'

What I do now fixes the problem but I don't really understand what is going on.

Encoding(address$city) <- 'UTF-8'
address$city <-  iconv(address$city, from =  "windows-1252","UTF-8")
address$city <-  iconv(address$city, from =  "latin1","UTF-8")

While this code works it seems more like a work around than a real solution. I'm sure it has to do with the encoding of the MySQL data as well as Windows as my OS but I wonder if there is a more elegant solution to this.

Additional info

dbGetQuery(con, "SHOW VARIABLES LIKE 'character_set_%';")
             Variable_name                                                         Value
1     character_set_client                                                          utf8
2 character_set_connection                                                          utf8
3   character_set_database                                                          utf8
4 character_set_filesystem                                                        binary
5    character_set_results                                                          utf8
6     character_set_server                                                          utf8
7     character_set_system                                                          utf8
8       character_sets_dir C:\\Program Files\\MySQL\\MySQL Server 5.7\\share\\charsets\\

and

Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Edit 1. hex

1809463 31240 L A°NION  3331323430204C2041C2B04E494F4E
2213341 85140 L AIE 3835313430204C2041C2904945
2614879 ELK GROVE VILLAGE A¨LLINOIS 60007   454C4B2047524F56452056494C4C4147452041C2A84C4C494E4F4953203630303037
4536353 F-69360 SAINT-SYPHORIEN D AZON  462D3639333630205341494E542D535950484F5249454E20442041C2905A4F4E
zx8754
  • 52,746
  • 12
  • 114
  • 209
CER
  • 854
  • 10
  • 22

1 Answers1

0

Do not use any conversion functions, it will probably make things worse.

¨ is Mojibake for ¨ and ° for °. Since I see A before each of those, I guess you are trying to enter an accented A by first typing the A, then the accent. However, your data entry tool is failing to combine those. What editor are you using?

(Yes, you have 'opened another chapter of the encoding from hell book' -- I have seen a lot of character set problems, but not this one until now.)

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Thanks for looking into it. What do you mean with what editor? Using MySQL Workbench and R. My guess is that the encoding in the DB is already messed up in a way – CER Mar 14 '18 at 23:05
  • What are the keystrok(es) for `Ä`? – Rick James Mar 15 '18 at 01:02
  • I have not done any of the data input so I can't say. I basically got the db as it is and try to clean it with R to use it for further analysis. While doing that I stumbled upon that encoding issue – CER Mar 15 '18 at 01:23
  • Then do `SELECT HEX(col) ...` so we can see what is actually there. – Rick James Mar 15 '18 at 01:32
  • 1
    Strange stuff. What should the text say? I see cities named `L'Union`, but I can't see how the apostrophe could morph into what you see. Ditto for `Illinois`. In any case, I am rather sure that the data was garbled before or when it was _inserted_. No changes to settings can get it out correctly. – Rick James Mar 15 '18 at 12:40
  • I'm afraid you might be right. I will look into it a bit more and return but appreciate your input so far – CER Mar 15 '18 at 15:50