Remove accent marks from characters while preserving other diacritics

Question

In a few Slavic languages, written in both Latin and Cyrillic, rising and falling accent marks are used only for disambiguation in context, ie inconsistently, only on vowels.

I would like a Python code or lib remove to acute and grave accents from vowels, while preserving other diacritics.

For example:
жѝзнеспосо́бный -> жизнеспособный
сè се фаќа -> се се фаќа
kȕćica -> kućica

If it's any help, here is a complete list of all the actual (ie unaccented) letters in Cyrillic alphabets for Slavic languages, including those with diacritics:

абвгдежзиклмнпорстуфхцшєґіїёыіўщъьюяйјњљџђћз́с́ќѓѕ

Note:

їёыіўй are vowels that should keep their diacritics even when acute and grave accent marks are stripped away. But it is very rare or perhaps impossible, we can ignore that case.
з́с́ќѓ are consonants, like Latin ćǵśź. They should keep their acute accent marks - they will not have any added for pronunciation or disambiguation purposes.
In the alphabets in which precise formal mappings are official, the Cyrillic equivalent of a Latin consonant with an acute accent will not necessarily have an acute accent. (Perhaps it is helpful.)
Double acute and double grave are a low priority.

Background reading on these characters:
https://en.wikipedia.org/wiki/I_with_grave_(Cyrillic)#East_Slavic_languages https://en.wikipedia.org/wiki/Shtokavian#Accentuation
https://en.wikipedia.org/wiki/Pitch_accent#Serbo-Croatian
https://en.wikipedia.org/wiki/Bulgarian_alphabet#.D0.8D
https://en.wikipedia.org/wiki/Macedonian_alphabet#Accented_letters

Similar questions:
Removing accents/diacritics from string while preserving other special chars (tried mb_chars.normalize and iconv)
How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?

I think the translate method in the linked question "How to remove accent in Python 3.5..." will be the simplest for your requirements. You just need to define what characters you want to translate to what others. — neil, Mar 11 '16 at 14:10
`str.translate` doesn't work for this, because `о́` is not/does not have a single code point. — Antti Haapala -- Слава Україні, May 13 '16 at 14:06

score 3 · Answer 1 · answered Mar 11 '16 at 14:44

No library required if you can table the corresponding pairs.

>>> unaccentify = {
...    'ѝ': 'и',
...    'о́': 'о'
... }

I was going to suggest string.translate for this, but unfortunately it wouldn't work because there's no single code point for о́. Thus we ensure that the left-hand characters are NFKC-normalized:

>>> import unicodedata
>>> unaccentify = {unicodedata.normalize('NFKC', i):j for i, j in unaccentify.items()}

Then we make a regex of all possible replaced letters:

>>> import re
>>> pattern = re.compile('|'.join(unaccentify))

Then use pattern.sub to do the replacement, looking the unaccented character from the table. But first we need to normalize the source string:

>>> def replacer(match):
...     return unaccentify[match.group(0)]
...
>>> source = unicodedata.normalize('NFKC', 'жѝзнеспосо́бный')
>>> pattern.sub(replacer, source)
'жизнеспособный'

Good point, there are only 5 real vowels. Another idea is to convert to Latin (since vowels can be converted reliably) with an existing lib (there are many), then remove diacritics with an existing lib (there are many), then convert back to Cyrillic with the first lib. I was hoping that the fact that there's no single code point (ie that the accents are encoded separately) would make it easy. — Adam Bittlingmayer, Mar 11 '16 at 17:23
You *can* separate the accents from the letters by doing `NFKD` normalization, but it would be just more work that way, because `ѓ` would turn into `г` + COMBINING ACUTE ACCENT. — Antti Haapala -- Слава Україні, Mar 11 '16 at 17:51

score 3 · Answer 2 · answered Jun 06 '20 at 08:25

This is inspired by the previous answer (mapping dictionaries are compatible), but makes it more complete and without regexp:

import unicodedata

ACCENT_MAPPING = {
    '́': '',
    '̀': '',
    'а́': 'а',
    'а̀': 'а',
    'е́': 'е',
    'ѐ': 'е',
    'и́': 'и',
    'ѝ': 'и',
    'о́': 'о',
    'о̀': 'о',
    'у́': 'у',
    'у̀': 'у',
    'ы́': 'ы',
    'ы̀': 'ы',
    'э́': 'э',
    'э̀': 'э',
    'ю́': 'ю',
    '̀ю': 'ю',
    'я́́': 'я',
    'я̀': 'я',
}
ACCENT_MAPPING = {unicodedata.normalize('NFKC', i): j for i, j in ACCENT_MAPPING.items()}


def unaccentify(s):
    source = unicodedata.normalize('NFKC', s)
    for old, new in ACCENT_MAPPING.items():
        source = source.replace(old, new)
    return source

Note, that speed was not of concern here.

I have not checked all characters though. Will update the answer if something odd will be found.

Remove accent marks from characters while preserving other diacritics

2 Answers2