I'm trying to store frequencies of words in a text in a Python dictionary. I apply some normalizations to the text to remove accent marks, symbols, punctuation, etc but after all of this the text still contains some words that raise an UnicodeEncodeError if printed. An example could be '\xe2\x80\x9c'. How can I get rid of those words?
Asked
Active
Viewed 294 times
1
-
Take a look at this answer: http://stackoverflow.com/a/3224300/2615940 – skrrgwasme May 11 '15 at 19:15
-
Thanks. I was searching without luck but seems that I can use len(word.encode('ascii', 'ignore')) != 0 to know if is a valid word. – David Moreno García May 11 '15 at 19:19
-
2@skrrgwasme wait, no. 'España' for example is a valid word (is printable) and that method would remove it. – David Moreno García May 11 '15 at 19:21
-
Given `'España'` as the input, what do you expect as the output? Nothing? `'Espana'`? – dawg May 12 '15 at 00:40
1 Answers
1
You can use the regex module (pip3 install regex
) to find all ASCII or non ASCII letters:
>>> import regex
>>> s='España'
>>> s
'España'
>>> regex.findall(r'\p{ASCII}', s)
['E', 's', 'p', 'a', 'a']
>>> regex.findall(r'\P{ASCII}', s)
['ñ']
You can use a character class or negated character class:
>>> import re
>>> re.findall(r'[a-zA-Z]', s)
['E', 's', 'p', 'a', 'a']
>>> re.findall(r'[^a-zA-Z]', s)
['ñ']
You can normalize without any diacriticals:
>>> import unicodedata
>>> ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
'Espana'
You can use the same methods with zalgo text
>>> s2.encode('utf-8')
b'S\xcc\x83\xcc\x84\xcd\xa7\xcc\x92\xcd\x8c\xcd\x8c\xcd\x9b\xcc\x8c\xcd\x8f\xcc\xa8\xcd\xa1\xcd\x94\xcc\xae\xcd\x89\xcd\x9a\xcc\xb0\xcc\xa3o\xcc\xbf\xcc\x8a\xcd\xa8\xcd\xa7\xcc\xa1\xcc\xaf\xcc\xab\xcc\xa9\xcc\xb0\xcd\x85\xcc\x96m\xcc\x8b\xcc\x87\xcc\x8c\xcd\xaa\xcd\xa1\xcd\x8f\xcc\xa6\xcc\xb0\xcd\x99\xcc\xa0\xcc\xa9\xcc\xa6\xcd\x88e\xcc\x82\xcc\x85\xcd\x8b\xcd\xa9\xcd\x8b\xcd\x8c\xcc\xa5\xcd\x9a\xcc\xba\xcc\xac \xcc\x94\xcd\xaa\xcd\x9f\xcc\xb6\xcc\xb8\xcc\xaa\xcc\xae\xcc\xb9z\xcc\x83\xcd\xa6\xcd\xa9\xcc\xb7\xcd\x98\xcc\x9d\xcc\x98\xcc\xa9\xcd\x9a\xcd\x9a\xcc\xac\xcc\x99a\xcc\x80\xcc\x88\xcc\x88\xcd\xa2\xcc\xb4\xcc\x98\xcc\xbb\xcc\xa6\xcc\xb2\xcc\x99l\xcc\x87\xcd\x82\xcc\x89\xcc\x86\xcc\x88\xcc\x94\xcc\x8d\xcc\xb7\xcc\xb6\xcd\xa1\xcc\xb3\xcc\xa5\xcc\x96\xcc\x9c\xcc\xae\xcc\xba\xcd\x99\xcc\x9dg\xcc\x8f\xcd\xa3\xcd\xad\xcc\x8c\xcd\x8b\xcc\x91\xcd\x83\xcc\x8f\xcc\xb0\xcd\x88o\xcd\xab\xcc\x90\xcd\xa4\xcd\x90\xcd\x84\xcd\xa3\xcd\x90\xcd\x9e\xcc\xa9\xcd\x96\xcd\x8e\xcc\xb9\xcc\xab\xcc\x96\xcc\xb9 \xcc\x87\xcc\xbf\xcc\x9b\xcc\x98\xcc\x97\xcd\x96\xcc\xae\xcc\x97t\xcd\xa6\xcc\xa0\xcc\x9f\xcc\xae\xcc\xb1\xcc\xb9\xcc\x9d\xcc\x9c\xcc\xade\xcd\x97\xcd\x83\xcc\xbe\xcd\xae\xcd\x8c\xcd\x84\xcc\xa7\xcc\xaa\xcc\x9d\xcc\xa6\xcc\xaa\xcc\xb1x\xcc\x84\xcc\x81\xcc\x8d\xcd\xa5\xcd\xad\xcd\xa9\xcd\x98\xcc\xa8\xcc\x9e\xcd\x9a\xcd\x93t\xcd\xac\xcc\x8b\xcc\x82\xcc\x87\xcc\xb4\xcd\x87\xcc\xb2\xcc\xab\xcd\x8e\xcd\x8d\xcc\xb9\xcd\x88'
>
S̃̄ͧ̒͌͌͛̌͏̨͔̮͉͚̰̣͡o̡̯̫̩̰̖̿̊ͨͧͅm̋̇̌ͪ͡͏̦̰͙̠̩̦͈ê̥͚̺̬̅͋ͩ͋͌ ̶̸̪̮̹̔ͪ͟z̷̝̘̩͚͚̬̙̃ͦͩ͘à̴̘̻̦̲̙̈̈͢l̷̶̳̥̖̜̮̺͙̝̇͂̉̆̈̔̍͡g̰͈̏ͣͭ̌͋̑̓̏o̩͖͎̹̫̖̹ͫ̐ͤ͐̈́ͣ͐͞ ̛̘̗͖̮̗̇̿t̠̟̮̱̹̝̜̭ͦȩ̪̝̦̪̱͗̓̾ͮ͌̈́x̨̞͚͓̄́̍ͥͭͩ͘t̴͇̲̫͎͍̹͈ͬ̋̂̇
.
>>> ''.join(regex.findall(r'\p{ASCII}', s2))
'Some zalgo text'
>>> ''.join((c for c in unicodedata.normalize('NFD', s2) if unicodedata.category(c) != 'Mn'))
'Some zalgo text'