Remove accents from accented characters

Question

I am looking for advice what library and/or function should I use to convert international text to it's English characters alternative.

For example

Vous avez aimé l'épée offerte par les elfes à Frodon

convert into

Vous avez aime l'epee offerte par les elfes a Frodon

@Janos - I am using `unidecode` now, but I got wrong results. For example `Etüde` is for some reason converted into `EtA1_4de` — Ωmega, Jul 10 '13 at 03:19
I see what you mean. You should add that in your question. Btw I cannot reproduce your issue, unidecode does work for me. — janos, Jul 10 '13 at 03:32
@Ωmega, You did `unidecode(encode_utf8("Et\N{LATIN SMALL LETTER U WITH DIAERESIS}de"))` instead of `unidecode("Et\N{LATIN SMALL LETTER U WITH DIAERESIS}de")`. — ikegami, Jul 10 '13 at 04:45
Trying to remove accents is almost always the wrong thing to do. I guess you want to: [How to match string with diacritic in perl?](http://stackoverflow.com/q/7429964) — daxim, Jul 10 '13 at 06:58
@ikegami - I don't use `encode_utf8`, but it may come already encoded. Should I decode it somehow before `unidecode` is used? — Ωmega, Jul 10 '13 at 11:58
@Ωmega, and I'm sure you don't use `\N{}` either. You missed the point. You did the equivalent of those. Yes, you should. `unidecode` expects text, not UTF-8. — ikegami, Jul 10 '13 at 13:14

score 16 · Accepted Answer · answered Jul 10 '13 at 03:18

16

First you can decompose the characters using Unicode::Normalize, then you can use a simple regex to delete all the diacriticals. (I think simply grabbing all the non-spacing mark characters should do it, but there might be an obscure exception or two.)

Here's an example:

use strict;
use warnings;
use utf8;

use Unicode::Normalize;

my $test = "Vous avez aimé l'épée offerte par les elfes à Frodon";

my $decomposed = NFKD( $test );
$decomposed =~ s/\p{NonspacingMark}//g;

print $decomposed;

Output:

Vous avez aime l'epee offerte par les elfes a Frodon

answered Jul 10 '13 at 03:18

friedo

65,762
16
114
184

2

`Etüde` >> `EtA1_4de` :-/ – Ωmega Jul 10 '13 at 03:25
Interesting, never thought about the `NonspacingMark`. The usual approach is the 'dictionary attack'... does this not presuppose that all such characters are composed though? Maybe I need to look at what `NFKD` does. – icedwater Jul 10 '13 at 03:25
@Ωmega, I don't get that result for Etüde. Are you `encode`ing your output? – friedo Jul 10 '13 at 03:37
@Ωmega, You changed something else too, because you passed the UTF-8 of `Etüde` rather than `Etüde`. – ikegami Jul 10 '13 at 03:40
I get an extra "%" at the end of the string. – frhack Apr 18 '17 at 09:54

Remove accents from accented characters

1 Answers1

Linked