0

I want to replace accented letters with equivalent basic letters using regular expressions.

Example: Û --> U

I have seen solutions where they search for ALL accented letters. But I'm looking for a better and more direct solution like :

someone suggested this in JAVA:

\p{Diacritic}/gu

or this in python

def remove_diacritics(text):
"""
Returns a string with all diacritics (aka non-spacing marks) removed.
For example "Héllô" will become "Hello".
Useful for comparing strings in an accent-insensitive fashion.
"""
normalized = unicodedata.normalize("NFKD", text)
return "".join(c for c in normalized if unicodedata.category(c) != "Mn")

But Im looking for a way without having to go through normalization

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
  • Does this not do what you want? https://stackoverflow.com/questions/35783135/regex-match-a-character-and-all-its-diacritic-variations-aka-accent-insensiti – lemonhead Sep 27 '21 at 23:38
  • Can you explain why normalization is not 'direct' enough? – Alex Hall Sep 27 '21 at 23:49
  • my question was how to do this with regular expressions – Imene KOLLI Sep 27 '21 at 23:51
  • Can you explain why you want to do this with regular expressions? – Alex Hall Sep 27 '21 at 23:56
  • I just want to know if there is way to do it with just regular expression. – Imene KOLLI Sep 28 '21 at 00:05
  • I also tried the normalization way. Somehow the Û always stays Û. – Imene KOLLI Sep 28 '21 at 00:06
  • 1
    `remove_diacritics("Û")` gives me `'U'`. – Alex Hall Sep 28 '21 at 00:09
  • There was a mistake in my code. It works now so THANK YOU .. I appreciate your help – Imene KOLLI Sep 28 '21 at 00:14
  • 1
    The normalization is necessary to make the problem tractable. Without normalization, the same grapheme could be represented in multiple ways, and trying to handle it with regex would require handling *all* the different representations. Normalization simplifies to handling a specific representation, not all possible representations. I'll also note, the set of all diacritic marked characters is huge, so unless you're limiting yourself to just latin-1, an exhaustive regex gets ugly quickly. – ShadowRanger Sep 28 '21 at 00:20
  • I understand .. thanks – Imene KOLLI Sep 28 '21 at 00:31

0 Answers0