Python: Translate utf-8 characters as good as possible to latin letters

Question

I have a text with many utf-8 characters such as š,œ and Ó.

There are many questions on this plattform regarding how to transform utf-8 to ascii focusing on the byte-encoding.

I would like to know wether there is a method that can be used in python to replace all of these characters to the

most similiar analog in the regex-pattern range [a-zA-Z0-9.,:?!@$€] ( or [α-ωΑ-Ωa-zA-Z0-9.,:?!@$€]), i.e. to all latin (greek) letters, numbers and punctuation signs.

That would yield š -> s, œ -> oe, Ó -> O. but † -> <nothing>

In case, no close relation can be found, e.g. for symbols or smileys, they should be deleted.

I know, it is subjective which characters can be identified with each other, but maybe there is an approximate solution.

the sdlib function `unicodedata.normalize` can help with some of that, but certain of the characters you want won't be decomposed to ascii counterparts. — jsbueno, Apr 04 '23 at 15:46
[This answer](https://stackoverflow.com/a/2633310) recommends something quite close to what you want (it translates `†` to `+`). If it is good enough for you, removing unwanted characters should be trivial. — InSync, Apr 04 '23 at 15:55
Does this answer your question? [What is the best way to remove accents (normalize) in a Python unicode string?](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string) — JosefZ, Apr 04 '23 at 20:45

Python: Translate utf-8 characters as good as possible to latin letters

0 Answers0