16

I'm writing a program that works with documents in Perl and a lot of the documents have characters such as ä, ö, ü, é, etc (both capital and lowercase). I'd like to replace them with ASCII counterparts a, o, u, e, etc. How would I do it in Perl?

One of the solutions I thought of is to have a hash with keys being the umlaut and accent characters, and the values being ASCII counterparts, but that requires me to have a list of all umlaut and accent characters, which I don't have, and if I built a list, I'd certainly miss many as I'm unfamiliar with all the possible characters that could have umlauts, accents and other diacritics.

bodacydo
  • 75,521
  • 93
  • 229
  • 319
  • 3
    Trying to remove accents is almost always the wrong thing to do. I guess you want to: [How to match string with diacritic in perl?](http://stackoverflow.com/q/7429964) – daxim Jun 15 '12 at 21:17
  • 1
    If not: [How can I substitute Unicode characters with ASCII in Perl?](http://stackoverflow.com/q/2309215) [How can I change extended latin characters to their unaccented ASCII equivalents?](http://stackoverflow.com/q/450026) – daxim Jun 15 '12 at 21:20
  • I think the first paragraph of Text::Unidecode module description defines the potential use cases well enough. ) It's not about collation only. – raina77ow Jun 15 '12 at 21:25
  • Thanks for all the answers. Text::Unidecode is exactly what I'm looking for! – bodacydo Jun 15 '12 at 21:30
  • This is going to fail on Greek. –  Jul 11 '21 at 16:47

4 Answers4

28

As usual, if you think of a problem which most certainly is not yours only, there's already a solution on CPAN. ) In this case it's called Text::Unidecode

use warnings;
use strict;
use utf8;
use Text::Unidecode;
print unidecode('ä, ö, ü, é'); # will print 'a, o, u, e'
raina77ow
  • 103,633
  • 15
  • 192
  • 229
3

Text::Unidecode

See the many disclaimers, but it's probably just what you need if you just have Latin text with diacritics.

mob
  • 117,087
  • 18
  • 149
  • 283
1

use s/// (=Search&Replace) instead of m// (=Match)

e.g. $name =~ s/\x00c0/A/g;

GuZ
  • 21
  • 3
0

i did this subroutine and i feed each word through it. This could be slow.

sub store_utf82_encoding{
##see file UTF8vowels.txt
#converts  UTF8 Euro vowels to nearest English equivant  

  my $name=$_[0];
  $name =~m/\x00c0/A/g; #Agrav
  $name =~m/\x00c1/A/g; # Aacute
  $name =~m/\x00c2/A/g; # Acap
  $name =~m/\x00c3/A/g; # Atilde
  $name =~m/\x00c4/A/g; # Auml
  $name =~m/\x00c5/A/g; # Aring
  $name =~m/\x00c6/AE/g; # AE
  $name =~m/\x00c7/Ch/g; # Ccedilla
  $name =~m/\x00c8/E/g; #Egrav
  $name =~m/\x00c9/E/g; # Eacute
  $name =~m/\x00ca/E/g; # Ecap
  $name =~m/\x00cb/E/g; # Euml
  $name =~m/\x00cc/I/g; # Igrav
  $name =~m/\x00cd/I/g; # Iacut
  $name =~m/\x00ce/I/g; # Icap
  $name =~m/\x00cf/I/g; # Iuml
  $name =~m/\x00d0/Th/g; #CapEth
  $name =~m/\x00d1/NY/g; # Ntild
  $name =~m/\x00d2/O/g; # Ograv
  $name =~m/\x00d3/O/g; # Oacute
  $name =~m/\x00d4/O/g; # Ocap
  $name =~m/\x00d5/Th/g; # Otilde
  $name =~m/\x00d6/O/g; # Ouml
  $name =~m/\x00d8/O/g; # Ostroke 
  $name =~m/\x00d9/U/g; # Ugrav
  $name =~m/\x00da/U/g; # Uacute
  $name =~m/\x00db/U/g; # Ucap
  $name =~m/\x00dc/U/g; # Uuml
  $name =~m/\x00dd/Y/g; # Yacute
  $name =~m/\x00de/Th/g; # CapThorn
  $name =~m/\x00df/SS/g; # GermanUCss Ezette
  $name =~m/\x00e0/a/g; # agrav
  $name =~m/\x00e1/a/g; # aacute 
  $name =~m/\x00e2/a/g; # acap
  $name =~m/\x00e3/a/g; # atilde
  $name =~m/\x00e4/a/g; # auml
  $name =~m/\x00e5/a/g; # aring
  $name =~m/\x00e6/ae/g; # ae
  $name =~m/\x00e7/ch/g; # ccedilla 
  $name =~m/\x00e8/e/g; # egrav
  $name =~m/\x00e9/e/g; # eacute
  $name =~m/\x00ea/e/g; # ecap
  $name =~m/\x00eb/e/g; # euml
  $name =~m/\x00ec/i/g; # igrav
  $name =~m/\x00ed/i/g; # iacute
  $name =~m/\x00ee/i/g; # icap
  $name =~m/\x00ef/i/g; # iuml
  $name =~m/\x00f0/th/g; # lowercase eth
  $name =~m/\x00f1/ny/g; # ntilde
  $name =~m/\x00f2/o/g; # ograv
  $name =~m/\x00f3/o/g; # oacute 
  $name =~m/\x00f4/o/g; # ocap
  $name =~m/\x00f5/th/g; # otilde
  $name =~m/\x00f6/o/g; # ouml
  $name =~m/\x00f8/o/g; # ostroke
  $name =~m/\x00f9/u/g; # ugrav
  $name =~m/\x00fa/u/g; # uacute
  $name =~m/\x00fb/u/g; # ucap
  $name =~m/\x00fc/u/g; # uuml
  $name =~m/\x00fe/th/g; # lowercase thorn
  $name =~m/\x00fd/y/g; # yacute
  $name =~m/\x00ff/y/g; # yuml

return $name;

} #endsub store_utf82_encoding