Replace accents from lists of foreign words

Question

Do you know if there are any linux programs out there to remove accents from lists of foreign words (in utf8)? Like Spanish, Czech, French. For instance:

administrátoři (czech) administratori
français (french) francais
niñez (spanish) ninez etc.

I know I could do it manually with sed, but it's relatively time-consuming considering that I'm working on a lot of languages. I thought a program that could do just that might exist already.

score 2 · Accepted Answer · edited May 23 '17 at 10:24

What you want is called Unicode decomposition -- the reverse process of Unicode composition (where you combine a base character with a diacritic). There are a number of related SO questions using:

which you can use as a starting point.

The Python repository has unicodedata.decomposition which returns a decomposed mapping.

Your system probably also has iconv and with suitable Normalization it may get you there too!

score 0 · Answer 2 · answered Feb 27 '15 at 14:19

Did you try using recode (at https://github.com/pinard/Recode/)? It removes accents while trying hard to preserve information and also can produce xlat tables expressed in C.

$ cat testfile 
administrátoři (czech) administratori
français (french) francais
niñez (spanish) ninez etc.
$ LANG= recode -f UTF-8..texte <testfile 
administrtori (czech) administratori
franc,ais (french) francais
niez (spanish) ninez etc.

Replace accents from lists of foreign words

2 Answers2