25

What is the optimal way to to remove German (or French) accents from a vector of 16 million string variables.

e.g., 'Sjögren's syndrome' into 'Sjogren's syndrome'

Converstion of single character into a single character is better then transliteration such as

ä => ae ö => oe ü => ue.

e.g., using regular expression would be one option but is there something better (R package for this)?

gsub('ü','u',gsub('ö','o',"Sjögren's syndrome ( über) "))

There are SO solutions for non-R platforms but not a good one for R.

userJT
  • 11,486
  • 20
  • 77
  • 88
  • 1
    See the answer to this post: [stackoverflow.com/questions/23699271/force-character-vector-encoding-from-unknown-to-utf-8-in-r][1] [1]: http://stackoverflow.com/questions/23699271/force-character-vector-encoding-from-unknown-to-utf-8-in-r – Alex Ioannides Oct 16 '14 at 12:59
  • See the answer to this post: [http://stackoverflow.com/questions/23699271/force-character-vector-encoding-from-unknown-to-utf-8-in-r][1] [1]: http://stackoverflow.com/questions/23699271/force-character-vector-encoding-from-unknown-to-utf-8-in-r – Alex Ioannides Oct 16 '14 at 13:01

2 Answers2

29

Use iconv to convert to ASCII with transliteration (if supported):

iconv(c("über","Sjögren's"),to="ASCII//TRANSLIT")
[1] "uber"      "Sjogren's"
James
  • 65,548
  • 14
  • 155
  • 193
  • 2
    for accented characters, e.g.`é`, this will result in something that looks like `'e`. Run this command over the output vector of the operation above: `out <- gsub("\\'", '', out)` – aaron Apr 27 '16 at 17:53
29

One of the linked answers suggest

library(stringi)
stri_trans_general("Zażółć gęślą jaźń", "Latin-ASCII")

[1] "Zazolc gesla jazn"
userJT
  • 11,486
  • 20
  • 77
  • 88