33

I have some strings in R in UTF-8 encoding that contain accents. E.g. string="Hølmer" or string="Elizalde-González"

Is there any nice function in R to replace the accented characters in these strings by their unaccented counterpart? I saw some solutions in PHP here, but how do I do this in R?

E.g. the PHP code

$unwanted_array = array(    'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
                            'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
                            'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
                            'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
                            'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );
$str = strtr( $str, $unwanted_array );

seemed quite nice - but how would I do this in R?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Tom Wenseleers
  • 7,535
  • 7
  • 63
  • 103

2 Answers2

65

The below answers are basically taken from elsewhere. The key is getting your unwanted_array in the right format. You might want it as a list:

unwanted_array = list(    'Š'='S', 'š'='s', 'Ž'='Z', 'ž'='z', 'À'='A', 'Á'='A', 'Â'='A', 'Ã'='A', 'Ä'='A', 'Å'='A', 'Æ'='A', 'Ç'='C', 'È'='E', 'É'='E',
                            'Ê'='E', 'Ë'='E', 'Ì'='I', 'Í'='I', 'Î'='I', 'Ï'='I', 'Ñ'='N', 'Ò'='O', 'Ó'='O', 'Ô'='O', 'Õ'='O', 'Ö'='O', 'Ø'='O', 'Ù'='U',
                            'Ú'='U', 'Û'='U', 'Ü'='U', 'Ý'='Y', 'Þ'='B', 'ß'='Ss', 'à'='a', 'á'='a', 'â'='a', 'ã'='a', 'ä'='a', 'å'='a', 'æ'='a', 'ç'='c',
                            'è'='e', 'é'='e', 'ê'='e', 'ë'='e', 'ì'='i', 'í'='i', 'î'='i', 'ï'='i', 'ð'='o', 'ñ'='n', 'ò'='o', 'ó'='o', 'ô'='o', 'õ'='o',
                            'ö'='o', 'ø'='o', 'ù'='u', 'ú'='u', 'û'='u', 'ý'='y', 'ý'='y', 'þ'='b', 'ÿ'='y' )

You can do this easily with iconv or chartr:

> iconv(string, to='ASCII//TRANSLIT')
[1] "Holmer"

> chartr(paste(names(unwanted_array), collapse=''),
         paste(unwanted_array, collapse=''),
         string)
[1] "Holmer"

Otherwise you have to loop through all of replacements because mapply or similar wouldn't account for symbols already replaced by previous gsub operations.:

# the loop:
out <- string
for(i in seq_along(unwanted_array))
    out <- gsub(names(unwanted_array)[i],unwanted_array[i],out)

The result:

> out
[1] "Holmer"
Thomas
  • 43,637
  • 12
  • 109
  • 140
  • 5
    Ha many thanks! I hadn't seen the other threads - sorry for that... iconv(string from="UTF-8",to="ASCII//TRANSLIT") did the trick for me! Thanks so much! – Tom Wenseleers Dec 10 '13 at 13:35
  • Note: unwanted_array above is missing 'ü'='u'. – Animik Apr 04 '18 at 15:10
  • Hi, this solution has been working fine for me for quite some time, but now it just stopped working, and I don't know why. I created a function called "cleanWord.r" where I included @Thomas function. I used to call it like: cleanWord("ááb") and the result was "aab". Now, I'm getting an error saying that chartr's argument "old" is longer than "new". I printed both of these arguments from the chartr function, and they are indeed different. The "new" argument is fine, but the "old" has some encoding problems. For instance, it starts like this: "ÀÃ\u0081". Any ideas on how to fix this? Tnks a lot – rt.l Aug 02 '20 at 10:59
  • 1
    The [solution](https://stackoverflow.com/a/38171652/3277050) on the linked question that uses stringi worked great for me and will be better for tidyverse folks since you probably already have the packge – see24 Mar 07 '22 at 16:40
  • I agree with see24, `stringi::stri_trans_general(string, "Latin-ASCII")` works much better! – Lennart Aug 21 '23 at 15:19
  • Also see: https://github.com/tidyverse/stringr/issues/149#issuecomment-289151373 – Lennart Aug 21 '23 at 15:20
9

Another option is to use gsubfn package:

library(gsubfn)
string="Hølmer"
gsubfn(paste(names(unwanted_array),collapse='|'), unwanted_array,string)
[1] "Holmer"
agstudy
  • 119,832
  • 17
  • 199
  • 261