Remove non printable words

Question

I'm trying to store frequencies of words in a text in a Python dictionary. I apply some normalizations to the text to remove accent marks, symbols, punctuation, etc but after all of this the text still contains some words that raise an UnicodeEncodeError if printed. An example could be '\xe2\x80\x9c'. How can I get rid of those words?

Take a look at this answer: http://stackoverflow.com/a/3224300/2615940 — skrrgwasme, May 11 '15 at 19:15
Thanks. I was searching without luck but seems that I can use len(word.encode('ascii', 'ignore')) != 0 to know if is a valid word. — David Moreno García, May 11 '15 at 19:19
@skrrgwasme wait, no. 'España' for example is a valid word (is printable) and that method would remove it. — David Moreno García, May 11 '15 at 19:21
Given `'España'` as the input, what do you expect as the output? Nothing? `'Espana'`? — dawg, May 12 '15 at 00:40

score 1 · Answer 1 · edited May 23 '17 at 11:59

You can use the regex module (pip3 install regex) to find all ASCII or non ASCII letters:

>>> import regex
>>> s='España'
>>> s
'España'
>>> regex.findall(r'\p{ASCII}', s)
['E', 's', 'p', 'a', 'a']
>>> regex.findall(r'\P{ASCII}', s)
['ñ']

You can use a character class or negated character class:

>>> import re
>>> re.findall(r'[a-zA-Z]', s)
['E', 's', 'p', 'a', 'a']
>>> re.findall(r'[^a-zA-Z]', s)
['ñ']

You can normalize without any diacriticals:

>>> import unicodedata
>>> ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
'Espana'

You can use the same methods with zalgo text

>>> s2.encode('utf-8')
b'S\xcc\x83\xcc\x84\xcd\xa7\xcc\x92\xcd\x8c\xcd\x8c\xcd\x9b\xcc\x8c\xcd\x8f\xcc\xa8\xcd\xa1\xcd\x94\xcc\xae\xcd\x89\xcd\x9a\xcc\xb0\xcc\xa3o\xcc\xbf\xcc\x8a\xcd\xa8\xcd\xa7\xcc\xa1\xcc\xaf\xcc\xab\xcc\xa9\xcc\xb0\xcd\x85\xcc\x96m\xcc\x8b\xcc\x87\xcc\x8c\xcd\xaa\xcd\xa1\xcd\x8f\xcc\xa6\xcc\xb0\xcd\x99\xcc\xa0\xcc\xa9\xcc\xa6\xcd\x88e\xcc\x82\xcc\x85\xcd\x8b\xcd\xa9\xcd\x8b\xcd\x8c\xcc\xa5\xcd\x9a\xcc\xba\xcc\xac \xcc\x94\xcd\xaa\xcd\x9f\xcc\xb6\xcc\xb8\xcc\xaa\xcc\xae\xcc\xb9z\xcc\x83\xcd\xa6\xcd\xa9\xcc\xb7\xcd\x98\xcc\x9d\xcc\x98\xcc\xa9\xcd\x9a\xcd\x9a\xcc\xac\xcc\x99a\xcc\x80\xcc\x88\xcc\x88\xcd\xa2\xcc\xb4\xcc\x98\xcc\xbb\xcc\xa6\xcc\xb2\xcc\x99l\xcc\x87\xcd\x82\xcc\x89\xcc\x86\xcc\x88\xcc\x94\xcc\x8d\xcc\xb7\xcc\xb6\xcd\xa1\xcc\xb3\xcc\xa5\xcc\x96\xcc\x9c\xcc\xae\xcc\xba\xcd\x99\xcc\x9dg\xcc\x8f\xcd\xa3\xcd\xad\xcc\x8c\xcd\x8b\xcc\x91\xcd\x83\xcc\x8f\xcc\xb0\xcd\x88o\xcd\xab\xcc\x90\xcd\xa4\xcd\x90\xcd\x84\xcd\xa3\xcd\x90\xcd\x9e\xcc\xa9\xcd\x96\xcd\x8e\xcc\xb9\xcc\xab\xcc\x96\xcc\xb9 \xcc\x87\xcc\xbf\xcc\x9b\xcc\x98\xcc\x97\xcd\x96\xcc\xae\xcc\x97t\xcd\xa6\xcc\xa0\xcc\x9f\xcc\xae\xcc\xb1\xcc\xb9\xcc\x9d\xcc\x9c\xcc\xade\xcd\x97\xcd\x83\xcc\xbe\xcd\xae\xcd\x8c\xcd\x84\xcc\xa7\xcc\xaa\xcc\x9d\xcc\xa6\xcc\xaa\xcc\xb1x\xcc\x84\xcc\x81\xcc\x8d\xcd\xa5\xcd\xad\xcd\xa9\xcd\x98\xcc\xa8\xcc\x9e\xcd\x9a\xcd\x93t\xcd\xac\xcc\x8b\xcc\x82\xcc\x87\xcc\xb4\xcd\x87\xcc\xb2\xcc\xab\xcd\x8e\xcd\x8d\xcc\xb9\xcd\x88'

>

S̃̄ͧ̒͌͌͛̌͏̨͔̮͉͚̰̣͡o̡̯̫̩̰̖̿̊ͨͧͅm̋̇̌ͪ͡͏̦̰͙̠̩̦͈ê̥͚̺̬̅͋ͩ͋͌ ̶̸̪̮̹̔ͪ͟z̷̝̘̩͚͚̬̙̃ͦͩ͘à̴̘̻̦̲̙̈̈͢l̷̶̳̥̖̜̮̺͙̝̇͂̉̆̈̔̍͡g̰͈̏ͣͭ̌͋̑̓̏o̩͖͎̹̫̖̹ͫ̐ͤ͐̈́ͣ͐͞ ̛̘̗͖̮̗̇̿t̠̟̮̱̹̝̜̭ͦȩ̪̝̦̪̱͗̓̾ͮ͌̈́x̨̞͚͓̄́̍ͥͭͩ͘t̴͇̲̫͎͍̹͈ͬ̋̂̇

.

>>> ''.join(regex.findall(r'\p{ASCII}', s2))
'Some zalgo text'
>>> ''.join((c for c in unicodedata.normalize('NFD', s2) if unicodedata.category(c) != 'Mn'))
'Some zalgo text'

Remove non printable words

1 Answers1