how to replace diacritics with basic letters using regular expressions?

Asked Sep 27 '21 at 23:33

Active Sep 28 '21 at 00:09

Viewed 141 times

I want to replace accented letters with equivalent basic letters using regular expressions.

Example: Û --> U

I have seen solutions where they search for ALL accented letters. But I'm looking for a better and more direct solution like :

someone suggested this in JAVA:

\p{Diacritic}/gu

or this in python

def remove_diacritics(text):
"""
Returns a string with all diacritics (aka non-spacing marks) removed.
For example "Héllô" will become "Hello".
Useful for comparing strings in an accent-insensitive fashion.
"""
normalized = unicodedata.normalize("NFKD", text)
return "".join(c for c in normalized if unicodedata.category(c) != "Mn")

But Im looking for a way without having to go through normalization

edited Sep 28 '21 at 00:09

eyllanesc

235,170
19
170
241

asked Sep 27 '21 at 23:33

Imene KOLLI

Does this not do what you want? https://stackoverflow.com/questions/35783135/regex-match-a-character-and-all-its-diacritic-variations-aka-accent-insensiti – lemonhead Sep 27 '21 at 23:38
Can you explain why normalization is not 'direct' enough? – Alex Hall Sep 27 '21 at 23:49
my question was how to do this with regular expressions – Imene KOLLI Sep 27 '21 at 23:51
Can you explain why you want to do this with regular expressions? – Alex Hall Sep 27 '21 at 23:56
I just want to know if there is way to do it with just regular expression. – Imene KOLLI Sep 28 '21 at 00:05
I also tried the normalization way. Somehow the Û always stays Û. – Imene KOLLI Sep 28 '21 at 00:06
1

`remove_diacritics("Û")` gives me `'U'`. – Alex Hall Sep 28 '21 at 00:09
There was a mistake in my code. It works now so THANK YOU .. I appreciate your help – Imene KOLLI Sep 28 '21 at 00:14
1

The normalization is necessary to make the problem tractable. Without normalization, the same grapheme could be represented in multiple ways, and trying to handle it with regex would require handling *all* the different representations. Normalization simplifies to handling a specific representation, not all possible representations. I'll also note, the set of all diacritic marked characters is huge, so unless you're limiting yourself to just latin-1, an exhaustive regex gets ugly quickly. – ShadowRanger Sep 28 '21 at 00:20
I understand .. thanks – Imene KOLLI Sep 28 '21 at 00:31

how to replace diacritics with basic letters using regular expressions?

0 Answers0