0

I have a text with many utf-8 characters such as š,œ and Ó.

There are many questions on this plattform regarding how to transform utf-8 to ascii focusing on the byte-encoding.

I would like to know wether there is a method that can be used in python to replace all of these characters to the

most similiar analog in the regex-pattern range [a-zA-Z0-9.,:?!@$€] ( or [α-ωΑ-Ωa-zA-Z0-9.,:?!@$€]), i.e. to all latin (greek) letters, numbers and punctuation signs.

That would yield š -> s, œ -> oe, Ó -> O. but † -> <nothing>

In case, no close relation can be found, e.g. for symbols or smileys, they should be deleted.

I know, it is subjective which characters can be identified with each other, but maybe there is an approximate solution.

markalex
  • 8,623
  • 2
  • 7
  • 32
Uwe.Schneider
  • 1,112
  • 1
  • 15
  • 28
  • 1
    the sdlib function `unicodedata.normalize` can help with some of that, but certain of the characters you want won't be decomposed to ascii counterparts. – jsbueno Apr 04 '23 at 15:46
  • 1
    [This answer](https://stackoverflow.com/a/2633310) recommends something quite close to what you want (it translates `†` to `+`). If it is good enough for you, removing unwanted characters should be trivial. – InSync Apr 04 '23 at 15:55
  • 1
    Does this answer your question? [What is the best way to remove accents (normalize) in a Python unicode string?](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string) – JosefZ Apr 04 '23 at 20:45

0 Answers0