Replace accented characters in R with non-accented counterpart (UTF-8 encoding)

Question

I have some strings in R in UTF-8 encoding that contain accents. E.g. string="Hølmer" or string="Elizalde-González"

Is there any nice function in R to replace the accented characters in these strings by their unaccented counterpart? I saw some solutions in PHP here, but how do I do this in R?

E.g. the PHP code

$unwanted_array = array(    'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
                            'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
                            'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
                            'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
                            'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );
$str = strtr( $str, $unwanted_array );

seemed quite nice - but how would I do this in R?

Thomas · Accepted Answer · 2015-05-15T20:49:58.553

65

The below answers are basically taken from elsewhere. The key is getting your unwanted_array in the right format. You might want it as a list:

unwanted_array = list(    'Š'='S', 'š'='s', 'Ž'='Z', 'ž'='z', 'À'='A', 'Á'='A', 'Â'='A', 'Ã'='A', 'Ä'='A', 'Å'='A', 'Æ'='A', 'Ç'='C', 'È'='E', 'É'='E',
                            'Ê'='E', 'Ë'='E', 'Ì'='I', 'Í'='I', 'Î'='I', 'Ï'='I', 'Ñ'='N', 'Ò'='O', 'Ó'='O', 'Ô'='O', 'Õ'='O', 'Ö'='O', 'Ø'='O', 'Ù'='U',
                            'Ú'='U', 'Û'='U', 'Ü'='U', 'Ý'='Y', 'Þ'='B', 'ß'='Ss', 'à'='a', 'á'='a', 'â'='a', 'ã'='a', 'ä'='a', 'å'='a', 'æ'='a', 'ç'='c',
                            'è'='e', 'é'='e', 'ê'='e', 'ë'='e', 'ì'='i', 'í'='i', 'î'='i', 'ï'='i', 'ð'='o', 'ñ'='n', 'ò'='o', 'ó'='o', 'ô'='o', 'õ'='o',
                            'ö'='o', 'ø'='o', 'ù'='u', 'ú'='u', 'û'='u', 'ý'='y', 'ý'='y', 'þ'='b', 'ÿ'='y' )

You can do this easily with iconv or chartr:

> iconv(string, to='ASCII//TRANSLIT')
[1] "Holmer"

> chartr(paste(names(unwanted_array), collapse=''),
         paste(unwanted_array, collapse=''),
         string)
[1] "Holmer"

Otherwise you have to loop through all of replacements because mapply or similar wouldn't account for symbols already replaced by previous gsub operations.:

# the loop:
out <- string
for(i in seq_along(unwanted_array))
    out <- gsub(names(unwanted_array)[i],unwanted_array[i],out)

The result:

> out
[1] "Holmer"

edited May 15 '15 at 20:49

answered Dec 10 '13 at 13:28

Thomas

43,637
12
109
140

5

Ha many thanks! I hadn't seen the other threads - sorry for that... iconv(string from="UTF-8",to="ASCII//TRANSLIT") did the trick for me! Thanks so much! – Tom Wenseleers Dec 10 '13 at 13:35
Note: unwanted_array above is missing 'ü'='u'. – Animik Apr 04 '18 at 15:10
Hi, this solution has been working fine for me for quite some time, but now it just stopped working, and I don't know why. I created a function called "cleanWord.r" where I included @Thomas function. I used to call it like: cleanWord("ááb") and the result was "aab". Now, I'm getting an error saying that chartr's argument "old" is longer than "new". I printed both of these arguments from the chartr function, and they are indeed different. The "new" argument is fine, but the "old" has some encoding problems. For instance, it starts like this: "Ã€Ã\u0081". Any ideas on how to fix this? Tnks a lot – rt.l Aug 02 '20 at 10:59
1

The [solution](https://stackoverflow.com/a/38171652/3277050) on the linked question that uses stringi worked great for me and will be better for tidyverse folks since you probably already have the packge – see24 Mar 07 '22 at 16:40
I agree with see24, `stringi::stri_trans_general(string, "Latin-ASCII")` works much better! – Lennart Aug 21 '23 at 15:19
Also see: https://github.com/tidyverse/stringr/issues/149#issuecomment-289151373 – Lennart Aug 21 '23 at 15:20

score 9 · Answer 2 · answered Dec 10 '13 at 13:47

9

Another option is to use gsubfn package:

library(gsubfn)
string="Hølmer"
gsubfn(paste(names(unwanted_array),collapse='|'), unwanted_array,string)
[1] "Holmer"

answered Dec 10 '13 at 13:47

agstudy

119,832
17
199
261

Some performance issues with this solution – s_baldur Aug 16 '17 at 17:17

Replace accented characters in R with non-accented counterpart (UTF-8 encoding)

2 Answers2

Linked

Related