0

I am pulling data from database into R. I use the following commands to do this:

drv = dbDriver("MySQL")
con = dbConnect(drv,<credentials>)
dbSendQuery(con,"SET character_set_results = utf8;")
<code to pull data>

The data is stored in UTF-8 encoding in the database.I pull a dataframe with a column containing words. Once i pull the data, i convert the encoding to ASCII//TRANSLIT using iconv(x,"UTF-8","ASCII//TRANSLIT") Everything is working fine except that for few words i see an extra character appearing after i change the encoding. For example when u look in database you see abc and when u import, you get abc. But once you change the encoding to ASCII//TRANSLIT this word changes to abc?. I used https://www.branah.com/unicode-converter to check for encoding. I copied word abc in first box named unicode text and i see abc⬠in box named utf-8 text.What are these special characters and how to use them in regex to filter these out?

Karanam Krishna
  • 365
  • 2
  • 16

1 Answers1

0

SET character_set_results = utf8 is probably not sufficient. Change to SET NAMES utf8mb4.

What do you mean by "pull the data"? Is it put into a database table? If so, please provide SHOW CREATE TABLE.

To investigate strange characters, do SELECT HEX(...) ... to see what is actually there. From that, we might be able to deduce what happened.

It looks like ⬠is part of the Mojibake for one of these. ⬀,⬁,⬂,⬃,⬄,⬅,⬆,⬇,⬈,⬉,⬊,⬋,⬌,⬍,⬎,⬏,⬐,⬑,⬒,⬓,⬔,⬕,⬖,⬗,⬘,⬙,⬚,⬛,⬜,⬝,⬞,⬟,⬠,⬡,⬢,⬣,⬤,⬥,⬦,⬧,⬨,⬩,⬪,⬫,⬬,⬭,⬮,⬯,⬰,⬱,⬲,⬳,⬴,⬵,⬶,⬷,⬸,⬹,⬺,⬻,⬼,⬽,⬾,⬿

â¬, when treated as latin1,        is hex E2AC
⬀ when treated as UTF-8 (utf8mb4), is hex E2AC80
⬁                                  is hex E2AC81,
etc

The causes of Mojibake are discussed here.

Instead of trying to filter them out, you should fix the code to preserve them.

Rick James
  • 135,179
  • 13
  • 127
  • 222