Regex for replacing à,Á,Ä etc. -> a, Õ,ò, etc. -> o

Question

Western Latin character set contains characters such as À Á Â Ã Ä Å which have all the same standard char 'a' as 'radix'. This happens on e,i,o,etc. as well. Is there a regex for replacing these variations to their 'radix' characters?

This would be used to create a seo friendly url from a text (but not limited to):

Example: La cena è pronta => La cena e pronta

Regexes are probably not the best tool to use for that. It would be easier to normalize to NFKD and then removing all non-spacing modifiers from the result. (But actually, what is it you want to _achieve_?) — hmakholm left over Monica, Aug 29 '11 at 19:51
@Daniel A. White: my question is pretty clear in its scope! I do not see what there is not clear. — ʞᴉɯ, Aug 29 '11 at 20:01
@Henning Makholm: thanks, i do not know about NFKD, now i will check. — ʞᴉɯ, Aug 29 '11 at 20:02
@Valerio The question is *not* clear. You only say what you *think* you need. You do not say what you want to do. You may very well be wrong about what you think you need. — Tomalak, Aug 29 '11 at 20:06
@Tomalak i do not think to be so stupid to not knowing what i want to do. as stated in question, i want to replace À Á Â Ã Ä Å to a, È É Ê Ë to e, etc. — ʞᴉɯ, Aug 29 '11 at 20:11
I think that Henning Makholm has correctly replied to my question. Thanks. If you leave the answer, i will vote for accepted solution. — ʞᴉɯ, Aug 29 '11 at 20:21
possible duplicate of [How do I remove diacritics (accents) from a string in .NET?](http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net) — BalusC, Aug 29 '11 at 20:23
Don't use regex. You ain't ever going to reliably cover all diacritics in a single regex. Use a normalizer. See the possible duplicate link. For interests in Java, see [this answer](http://stackoverflow.com/questions/3658991/how-to-translate-lorem-3-ipsum-dolor-sit-amet-into-seo-friendly-lorem-3-ipsum/3659154#3659154). — BalusC, Aug 29 '11 at 20:24
@BalusC Technically you can use Regex for Phase 2 :-) Yeah it's probably overkill, but it's some less lines of code :-) :-) — xanatos, Aug 29 '11 at 20:31

xanatos · Accepted Answer · 2011-08-30T07:54:59.233

2

Try this:

string str = "La cena è pronta àèéìòùçæÀÈÉÌÒÙÇÆ";
str = str.Normalize(NormalizationForm.FormD); // Or use NormalizationForm.FormKD
str = Regex.Replace(str, @"\p{Mn}", "");
// Result: La cena e pronta aeeioucæAEEIOUCÆ

But note that Æ remains Æ.

edited Aug 30 '11 at 07:54

answered Aug 29 '11 at 20:26

xanatos

109,618
12
197
280

For URL generation, it would probably be better to select specifically, the ASCII letters in the normalized string. – hmakholm left over Monica Aug 29 '11 at 20:45
Why FormD instead of initial suggested FormKD? – ʞᴉɯ Aug 30 '11 at 07:43
@Valerio True... We are already taking away the marks... We could use the KD. The difference between the two is that `FormD` preservers formatting information, `FormKD` doesn't. But for example in http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net/249126#249126 they use `FormD` – xanatos Aug 30 '11 at 07:51

Regex for replacing à,Á,Ä etc. -> a, Õ,ò, etc. -> o

1 Answers1