7

I am looking for advice what library and/or function should I use to convert international text to it's English characters alternative.

For example

Vous avez aimé l'épée offerte par les elfes à Frodon 

convert into

Vous avez aime l'epee offerte par les elfes a Frodon 
icedwater
  • 4,701
  • 3
  • 35
  • 50
Ωmega
  • 42,614
  • 34
  • 134
  • 203
  • @Janos - I am using `unidecode` now, but I got wrong results. For example `Etüde` is for some reason converted into `EtA1_4de` – Ωmega Jul 10 '13 at 03:19
  • I see what you mean. You should add that in your question. Btw I cannot reproduce your issue, unidecode does work for me. – janos Jul 10 '13 at 03:32
  • @Ωmega, You did `unidecode(encode_utf8("Et\N{LATIN SMALL LETTER U WITH DIAERESIS}de"))` instead of `unidecode("Et\N{LATIN SMALL LETTER U WITH DIAERESIS}de")`. – ikegami Jul 10 '13 at 04:45
  • Trying to remove accents is almost always the wrong thing to do. I guess you want to: [How to match string with diacritic in perl?](http://stackoverflow.com/q/7429964) – daxim Jul 10 '13 at 06:58
  • @ikegami - I don't use `encode_utf8`, but it may come already encoded. Should I decode it somehow before `unidecode` is used? – Ωmega Jul 10 '13 at 11:58
  • @Ωmega, and I'm sure you don't use `\N{}` either. You missed the point. You did the equivalent of those. Yes, you should. `unidecode` expects text, not UTF-8. – ikegami Jul 10 '13 at 13:14

1 Answers1

16

First you can decompose the characters using Unicode::Normalize, then you can use a simple regex to delete all the diacriticals. (I think simply grabbing all the non-spacing mark characters should do it, but there might be an obscure exception or two.)

Here's an example:

use strict;
use warnings;
use utf8;

use Unicode::Normalize;

my $test = "Vous avez aimé l'épée offerte par les elfes à Frodon";

my $decomposed = NFKD( $test );
$decomposed =~ s/\p{NonspacingMark}//g;

print $decomposed;

Output:

Vous avez aime l'epee offerte par les elfes a Frodon
friedo
  • 65,762
  • 16
  • 114
  • 184