Is there a common regular expression that replaces all known special characters in non-English languages:
é, ô, ç, etc.
with English characters:
e, o, c, etc.
Is there a common regular expression that replaces all known special characters in non-English languages:
é, ô, ç, etc.
with English characters:
e, o, c, etc.
This cannot be done, and you should not want to do it! It’s offensive to the whole world, and it’s naïve to the point of ignorance to believe that façade rhymes with arcade, or that Cañon City, Colorado falls under canon law.
You could run the string through Unicode Normalization Form D and discard mark characters, but I am certainly not going to tell you how because it is evil and wrong. It is evil for reasons already outlined, and it is wrong because there are zillion cases it doesn’t address at all.
Here are what you need to read up on:
You MUST learn how to compare strings in a way that makes sense, and mutilating them simply never makes any sense whatso [pəʇələp] ever.
You must never just compare unnormalized strings code point by code point, and if possible you need to take the language into account, since rules differ between them.
No matter the programming language you’re using, it may also help you to read the documentation for Perl’s Unicode::Normalize, Unicode::Collate, and Unicode::Collate::Locale modules.
For example, to search for "MÜSS"
in a text that has "muß"
in it, you would do this:
my $Collator = Unicode::Collate->new( normalization => undef, level => 1 );
# (normalization => undef) is REQUIRED.
my $str = "Ich muß studieren Perl.";
my $sub = "MÜSS";
my $match;
if (my($pos,$len) = $Collator->index($str, $sub)) {
$match = substr($str, $pos, $len);
}
That will put "muß"
into $match
.
The Unicode::Collate::Module
has support for tailoring to these locales:
af Afrikaans
ar Arabic
az Azerbaijani (Azeri)
be Belarusian
bg Bulgarian
ca Catalan
cs Czech
cy Welsh
da Danish
de__phonebook German (umlaut as 'ae', 'oe', 'ue')
eo Esperanto
es Spanish
es__traditional Spanish ('ch' and 'll' as a grapheme)
et Estonian
fi Finnish
fil Filipino
fo Faroese
fr French
ha Hausa
haw Hawaiian
hr Croatian
hu Hungarian
hy Armenian
ig Igbo
is Icelandic
ja Japanese [1]
kk Kazakh
kl Kalaallisut
ko Korean [2]
lt Lithuanian
lv Latvian
mk Macedonian
mt Maltese
nb Norwegian Bokmal
nn Norwegian Nynorsk
nso Northern Sotho
om Oromo
pl Polish
ro Romanian
ru Russian
se Northern Sami
sk Slovak
sl Slovenian
sq Albanian
sr Serbian
sv Swedish
sw Swahili
tn Tswana
to Tonga
tr Turkish
uk Ukrainian
vi Vietnamese
wo Wolof
yo Yoruba
zh Chinese
zh__big5han Chinese (ideographs: big5 order)
zh__gb2312han Chinese (ideographs: GB-2312 order)
zh__pinyin Chinese (ideographs: pinyin order)
zh__stroke Chinese (ideographs: stroke order)
You have a choice: you can do this right, or you can not do it at all. No one will thank you if you do it wrong.
Doing it right means taking UAX#15 and UTS#10 into account.
Nothing less is acceptable in this day and age. It’s not the 1960s any more, you know!
No, there is no such regex. Note that with a regex you "describe" a specific piece of text.
A certain regex implementation might provide the possibility to do replacements using regex, but these replacements are usually only performed by a single replacement: not replace a
with a'
and b
with b'
etc.
Perhaps the language you're working with has a method in its API to perform this kind of replacements, but it won't be using regex.