I have a text with many utf-8 characters such as š
,œ
and Ó
.
There are many questions on this plattform regarding how to transform utf-8 to ascii focusing on the byte-encoding.
I would like to know wether there is a method that can be used in python to replace all of these characters to the
most similiar analog in the regex-pattern range
[a-zA-Z0-9.,:?!@$€]
( or [α-ωΑ-Ωa-zA-Z0-9.,:?!@$€]
), i.e. to all latin (greek) letters, numbers and punctuation signs.
That would yield š -> s
, œ -> oe
, Ó -> O
. but † -> <nothing>
In case, no close relation can be found, e.g. for symbols or smileys, they should be deleted.
I know, it is subjective which characters can be identified with each other, but maybe there is an approximate solution.