Regular Expression To Anglicize String Characters?

Question

Is there a common regular expression that replaces all known special characters in non-English languages:

é, ô, ç, etc.

with English characters:

e, o, c, etc.

possible duplicate of http://stackoverflow.com/questions/930303/python-string-cleanup-manipulation-accented-characters/930316#930316 — Bobby, Nov 13 '10 at 18:40
Regular expressions describe regular languages. They don’t do anything else. — Gumbo, Nov 13 '10 at 18:44

tchrist · Answer 1 · 2010-11-13T22:27:40.037

¡⅁uoɹʍ puɐ ⅂IɅƎ

This cannot be done, and you should not want to do it! It’s offensive to the whole world, and it’s naïve to the point of ignorance to believe that façade rhymes with arcade, or that Cañon City, Colorado falls under canon law.

You could run the string through Unicode Normalization Form D and discard mark characters, but I am certainly not going to tell you how because it is evil and wrong. It is evil for reasons already outlined, and it is wrong because there are zillion cases it doesn’t address at all.

Study Material

Here are what you need to read up on:

Unicode Normalization Forms - UAX #15 This annex describes normalization forms for Unicode text. When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation. This annex also provides examples, additional specifications regarding normalization of Unicode text, and information about conformance testing for Unicode normalization forms.
Canonical Equivalence in Applications - UTN #5 This document describes methods and formats for efficient processing of text under canonical equivalence, as defined in UAX #15 Unicode Normalization Forms [UAX15].
Unicode Collation Algorithm - UTS #10 This report is the specification of the Unicode Collation Algorithm (UCA), which details how to compare two Unicode strings while remaining conformant to the requirements of the Unicode Standard. The UCA also supplies the Default Unicode Collation Element Table (DUCET) as the data specifying the default collation order for all Unicode characters.

You MUST learn how to compare strings in a way that makes sense, and mutilating them simply never makes any sense whatso [pəʇələp] ever.

You must never just compare unnormalized strings code point by code point, and if possible you need to take the language into account, since rules differ between them.

Practical Examples

No matter the programming language you’re using, it may also help you to read the documentation for Perl’s Unicode::Normalize, Unicode::Collate, and Unicode::Collate::Locale modules.

For example, to search for "MÜSS" in a text that has "muß" in it, you would do this:

my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
                                     # (normalization => undef) is REQUIRED.
  my $str = "Ich muß studieren Perl.";
  my $sub = "MÜSS";
  my $match;
  if (my($pos,$len) = $Collator->index($str, $sub)) {
      $match = substr($str, $pos, $len);
  }

That will put "muß" into $match.

The Unicode::Collate::Module has support for tailoring to these locales:

 af                Afrikaans
 ar                Arabic
 az                Azerbaijani (Azeri)
 be                Belarusian
 bg                Bulgarian
 ca                Catalan
 cs                Czech
 cy                Welsh
 da                Danish
 de__phonebook     German (umlaut as 'ae', 'oe', 'ue')
 eo                Esperanto
 es                Spanish
 es__traditional   Spanish ('ch' and 'll' as a grapheme)
 et                Estonian
 fi                Finnish
 fil               Filipino
 fo                Faroese
 fr                French
 ha                Hausa
 haw               Hawaiian
 hr                Croatian
 hu                Hungarian
 hy                Armenian
 ig                Igbo
 is                Icelandic
 ja                Japanese [1]
 kk                Kazakh
 kl                Kalaallisut
 ko                Korean [2]
 lt                Lithuanian
 lv                Latvian
 mk                Macedonian
 mt                Maltese
 nb                Norwegian Bokmal
 nn                Norwegian Nynorsk
 nso               Northern Sotho
 om                Oromo
 pl                Polish
 ro                Romanian
 ru                Russian
 se                Northern Sami
 sk                Slovak
 sl                Slovenian
 sq                Albanian
 sr                Serbian
 sv                Swedish
 sw                Swahili
 tn                Tswana
 to                Tonga
 tr                Turkish
 uk                Ukrainian
 vi                Vietnamese
 wo                Wolof
 yo                Yoruba
 zh                Chinese
 zh__big5han       Chinese (ideographs: big5 order)
 zh__gb2312han     Chinese (ideographs: GB-2312 order)
 zh__pinyin        Chinese (ideographs: pinyin order)
 zh__stroke        Chinese (ideographs: stroke order)

You have a choice: you can do this right, or you can not do it at all. No one will thank you if you do it wrong.

Doing it right means taking UAX#15 and UTS#10 into account.

Nothing less is acceptable in this day and age. It’s not the 1960s any more, you know!

"Rene-Levesque" not finding "René-Lévesque" in my french database is far more evil. — Chunky Chunk, Nov 13 '10 at 18:52
Then the problem is with your search, and you should instead be asking how to perform a search which is more intelligent about its character matching — Gareth, Nov 13 '10 at 19:04
@TheDark: Voilà, in my latest edit I have done just as you asked: given you information to use in a search method. ¡λoſ̣uƎ — tchrist, Nov 13 '10 at 22:30
"You have a choice: you can do this right, or you can not do it at all. No one will thank you if you do it wrong." That seems like an overly simplified bifurcation, and an exceedingly broad and inaccurate generalization. — WouldRatherBuildAMotor, Sep 24 '18 at 14:08
Situation: you are forced to match the *correct* version of a string against a vendor's database of incorrect addresses with, say, both "Zürich" and "Zurich" present in them. Also, this is the best available database for the problem at hand and you are timeboxed. Finally, the abuses are consistently of the sort that map characters with diacritics to the most similar-looking English ones ('60s-style), and you are incentivized to get as many matches correct (in a conceptual sense) as you can. --- While wrong, it's useful to have an "Anglicization mapping" in this case, as heinous as it is. — Salmonstrikes, Dec 08 '21 at 11:21

score 3 · Accepted Answer · answered Nov 13 '10 at 18:43

No, there is no such regex. Note that with a regex you "describe" a specific piece of text.

A certain regex implementation might provide the possibility to do replacements using regex, but these replacements are usually only performed by a single replacement: not replace a with a' and b with b' etc.

Perhaps the language you're working with has a method in its API to perform this kind of replacements, but it won't be using regex.

score 0 · Answer 3 · answered Nov 13 '10 at 18:42

0

This task is what the iconv library is for. Find out how to use it in whichever language you're developing in.

Chances are your library already has a binding for it

answered Nov 13 '10 at 18:42

Gareth

133,157
36
148
157

Regular Expression To Anglicize String Characters?

3 Answers3

¡⅁uoɹʍ puɐ ⅂IɅƎ

Study Material

Practical Examples

Linked

Related