I want to remove all the punctuation, special letters like "ū","ú","ǔ","ù","ǖ","ǘ","ǚ","ǜ","ü","û"
, ▬▬▬▬▬▬▬▬◄
and any other chars chars, except the numbers, latin letters and cyrillic
.
the input string is encoded as utf-8
How to realize this ?
Asked
Active
Viewed 1,279 times
-1

yanachen
- 3,401
- 8
- 32
- 64
-
What do you mean by removing "special letters" but not "latin letters"? Letters like "ú" are latin. – jaboja May 28 '18 at 10:00
-
1If you want just to remove accents, then see this answer: https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string#518232 However, keep in mind that "latin letters" are not only the 26 letters of English alphabet. There are still all cases like the IJ digraph (from Dutch) or Ł letter (from Polish). Same applies to Cyrillic script, its a lot more than just Russian alphabet. – jaboja May 28 '18 at 10:09
1 Answers
2
from string import ascii_letters, digits, whitespace
cyrillic_letters = u"абвгдеёжзийклмнопрстуфхцчшщъыьэюяАБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ"
def strip(text):
allowed_chars = cyrillic_letters + ascii_letters + digits + whitespace
print(allowed_chars)
return "".join([c for c in text if c in allowed_chars])
edit: Not familiar with the Cyrillic alphabet but this is how I managed to strip characters except as you specified Cyrillic-letters, latin-letters, non-numbers and (I added this one) whitespace from a string.

BARJ
- 1,543
- 3
- 20
- 28