0

Do you know if there are any linux programs out there to remove accents from lists of foreign words (in utf8)? Like Spanish, Czech, French. For instance:

administrátoři (czech) administratori
français (french) francais
niñez (spanish) ninez etc.

I know I could do it manually with sed, but it's relatively time-consuming considering that I'm working on a lot of languages. I thought a program that could do just that might exist already.

Kara
  • 6,115
  • 16
  • 50
  • 57
bobylapointe
  • 663
  • 1
  • 5
  • 12

2 Answers2

2

What you want is called Unicode decomposition -- the reverse process of Unicode composition (where you combine a base character with a diacritic). There are a number of related SO questions using:

  1. JavaScript
  2. ActionScript
  3. Python

which you can use as a starting point.

The Python repository has unicodedata.decomposition which returns a decomposed mapping.

Your system probably also has iconv and with suitable Normalization it may get you there too!

Community
  • 1
  • 1
dirkgently
  • 108,024
  • 16
  • 131
  • 187
0

Did you try using recode (at https://github.com/pinard/Recode/)? It removes accents while trying hard to preserve information and also can produce xlat tables expressed in C.

$ cat testfile 
administrátoři (czech) administratori
français (french) francais
niñez (spanish) ninez etc.
$ LANG= recode -f UTF-8..texte <testfile 
administrtori (czech) administratori
franc,ais (french) francais
niez (spanish) ninez etc.
natmaka
  • 1
  • 2