I am pulling data from database into R. I use the following commands to do this:
drv = dbDriver("MySQL")
con = dbConnect(drv,<credentials>)
dbSendQuery(con,"SET character_set_results = utf8;")
<code to pull data>
The data is stored in UTF-8 encoding in the database.I pull a dataframe with a column containing words. Once i pull the data, i convert the encoding to ASCII//TRANSLIT using iconv(x,"UTF-8","ASCII//TRANSLIT")
Everything is working fine except that for few words i see an extra character appearing after i change the encoding. For example when u look in database you see abc
and when u import, you get abc
. But once you change the encoding to ASCII//TRANSLIT this word changes to abc?
. I used https://www.branah.com/unicode-converter to check for encoding. I copied word abc in first box named unicode text and i see abc⬠in box named utf-8 text.What are these special characters and how to use them in regex to filter these out?