0

Western Latin character set contains characters such as À Á Â Ã Ä Å which have all the same standard char 'a' as 'radix'. This happens on e,i,o,etc. as well. Is there a regex for replacing these variations to their 'radix' characters?

This would be used to create a seo friendly url from a text (but not limited to):

Example: La cena è pronta => La cena e pronta

Keng
  • 52,011
  • 32
  • 81
  • 111
ʞᴉɯ
  • 5,376
  • 7
  • 52
  • 89
  • 6
    Regexes are probably not the best tool to use for that. It would be easier to normalize to NFKD and then removing all non-spacing modifiers from the result. (But actually, what is it you want to _achieve_?) – hmakholm left over Monica Aug 29 '11 at 19:51
  • @Daniel A. White: my question is pretty clear in its scope! I do not see what there is not clear. – ʞᴉɯ Aug 29 '11 at 20:01
  • @Henning Makholm: thanks, i do not know about NFKD, now i will check. – ʞᴉɯ Aug 29 '11 at 20:02
  • please show a more clear example. – Daniel A. White Aug 29 '11 at 20:02
  • @Valerio The question is *not* clear. You only say what you *think* you need. You do not say what you want to do. You may very well be wrong about what you think you need. – Tomalak Aug 29 '11 at 20:06
  • @Tomalak i do not think to be so stupid to not knowing what i want to do. as stated in question, i want to replace À Á Â Ã Ä Å to a, È É Ê Ë to e, etc. – ʞᴉɯ Aug 29 '11 at 20:11
  • I think that Henning Makholm has correctly replied to my question. Thanks. If you leave the answer, i will vote for accepted solution. – ʞᴉɯ Aug 29 '11 at 20:21
  • possible duplicate of [How do I remove diacritics (accents) from a string in .NET?](http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net) – BalusC Aug 29 '11 at 20:23
  • Don't use regex. You ain't ever going to reliably cover all diacritics in a single regex. Use a normalizer. See the possible duplicate link. For interests in Java, see [this answer](http://stackoverflow.com/questions/3658991/how-to-translate-lorem-3-ipsum-dolor-sit-amet-into-seo-friendly-lorem-3-ipsum/3659154#3659154). – BalusC Aug 29 '11 at 20:24
  • @BalusC Technically you can use Regex for Phase 2 :-) Yeah it's probably overkill, but it's some less lines of code :-) :-) – xanatos Aug 29 '11 at 20:31

1 Answers1

2

Try this:

string str = "La cena è pronta àèéìòùçæÀÈÉÌÒÙÇÆ";
str = str.Normalize(NormalizationForm.FormD); // Or use NormalizationForm.FormKD
str = Regex.Replace(str, @"\p{Mn}", "");
// Result: La cena e pronta aeeioucæAEEIOUCÆ

But note that Æ remains Æ.

xanatos
  • 109,618
  • 12
  • 197
  • 280
  • For URL generation, it would probably be better to select specifically, the ASCII letters in the normalized string. – hmakholm left over Monica Aug 29 '11 at 20:45
  • Why FormD instead of initial suggested FormKD? – ʞᴉɯ Aug 30 '11 at 07:43
  • @Valerio True... We are already taking away the marks... We could use the KD. The difference between the two is that `FormD` preservers formatting information, `FormKD` doesn't. But for example in http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net/249126#249126 they use `FormD` – xanatos Aug 30 '11 at 07:51