1

I'm trying to store frequencies of words in a text in a Python dictionary. I apply some normalizations to the text to remove accent marks, symbols, punctuation, etc but after all of this the text still contains some words that raise an UnicodeEncodeError if printed. An example could be '\xe2\x80\x9c'. How can I get rid of those words?

David Moreno García
  • 4,423
  • 8
  • 49
  • 82

1 Answers1

1

You can use the regex module (pip3 install regex) to find all ASCII or non ASCII letters:

>>> import regex
>>> s='España'
>>> s
'España'
>>> regex.findall(r'\p{ASCII}', s)
['E', 's', 'p', 'a', 'a']
>>> regex.findall(r'\P{ASCII}', s)
['ñ']

You can use a character class or negated character class:

>>> import re
>>> re.findall(r'[a-zA-Z]', s)
['E', 's', 'p', 'a', 'a']
>>> re.findall(r'[^a-zA-Z]', s)
['ñ']

You can normalize without any diacriticals:

>>> import unicodedata
>>> ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
'Espana'

You can use the same methods with zalgo text

>>> s2.encode('utf-8')
b'S\xcc\x83\xcc\x84\xcd\xa7\xcc\x92\xcd\x8c\xcd\x8c\xcd\x9b\xcc\x8c\xcd\x8f\xcc\xa8\xcd\xa1\xcd\x94\xcc\xae\xcd\x89\xcd\x9a\xcc\xb0\xcc\xa3o\xcc\xbf\xcc\x8a\xcd\xa8\xcd\xa7\xcc\xa1\xcc\xaf\xcc\xab\xcc\xa9\xcc\xb0\xcd\x85\xcc\x96m\xcc\x8b\xcc\x87\xcc\x8c\xcd\xaa\xcd\xa1\xcd\x8f\xcc\xa6\xcc\xb0\xcd\x99\xcc\xa0\xcc\xa9\xcc\xa6\xcd\x88e\xcc\x82\xcc\x85\xcd\x8b\xcd\xa9\xcd\x8b\xcd\x8c\xcc\xa5\xcd\x9a\xcc\xba\xcc\xac \xcc\x94\xcd\xaa\xcd\x9f\xcc\xb6\xcc\xb8\xcc\xaa\xcc\xae\xcc\xb9z\xcc\x83\xcd\xa6\xcd\xa9\xcc\xb7\xcd\x98\xcc\x9d\xcc\x98\xcc\xa9\xcd\x9a\xcd\x9a\xcc\xac\xcc\x99a\xcc\x80\xcc\x88\xcc\x88\xcd\xa2\xcc\xb4\xcc\x98\xcc\xbb\xcc\xa6\xcc\xb2\xcc\x99l\xcc\x87\xcd\x82\xcc\x89\xcc\x86\xcc\x88\xcc\x94\xcc\x8d\xcc\xb7\xcc\xb6\xcd\xa1\xcc\xb3\xcc\xa5\xcc\x96\xcc\x9c\xcc\xae\xcc\xba\xcd\x99\xcc\x9dg\xcc\x8f\xcd\xa3\xcd\xad\xcc\x8c\xcd\x8b\xcc\x91\xcd\x83\xcc\x8f\xcc\xb0\xcd\x88o\xcd\xab\xcc\x90\xcd\xa4\xcd\x90\xcd\x84\xcd\xa3\xcd\x90\xcd\x9e\xcc\xa9\xcd\x96\xcd\x8e\xcc\xb9\xcc\xab\xcc\x96\xcc\xb9 \xcc\x87\xcc\xbf\xcc\x9b\xcc\x98\xcc\x97\xcd\x96\xcc\xae\xcc\x97t\xcd\xa6\xcc\xa0\xcc\x9f\xcc\xae\xcc\xb1\xcc\xb9\xcc\x9d\xcc\x9c\xcc\xade\xcd\x97\xcd\x83\xcc\xbe\xcd\xae\xcd\x8c\xcd\x84\xcc\xa7\xcc\xaa\xcc\x9d\xcc\xa6\xcc\xaa\xcc\xb1x\xcc\x84\xcc\x81\xcc\x8d\xcd\xa5\xcd\xad\xcd\xa9\xcd\x98\xcc\xa8\xcc\x9e\xcd\x9a\xcd\x93t\xcd\xac\xcc\x8b\xcc\x82\xcc\x87\xcc\xb4\xcd\x87\xcc\xb2\xcc\xab\xcd\x8e\xcd\x8d\xcc\xb9\xcd\x88'

>

S̃̄ͧ̒͌͌͛̌͏̨͔̮͉͚̰̣͡o̡̯̫̩̰̖̿̊ͨͧͅm̋̇̌ͪ͡͏̦̰͙̠̩̦͈ê̥͚̺̬̅͋ͩ͋͌ ̶̸̪̮̹̔ͪ͟z̷̝̘̩͚͚̬̙̃ͦͩ͘à̴̘̻̦̲̙̈̈͢l̷̶̳̥̖̜̮̺͙̝̇͂̉̆̈̔̍͡g̰͈̏ͣͭ̌͋̑̓̏o̩͖͎̹̫̖̹ͫ̐ͤ͐̈́ͣ͐͞ ̛̘̗͖̮̗̇̿t̠̟̮̱̹̝̜̭ͦȩ̪̝̦̪̱͗̓̾ͮ͌̈́x̨̞͚͓̄́̍ͥͭͩ͘t̴͇̲̫͎͍̹͈ͬ̋̂̇

.

>>> ''.join(regex.findall(r'\p{ASCII}', s2))
'Some zalgo text'
>>> ''.join((c for c in unicodedata.normalize('NFD', s2) if unicodedata.category(c) != 'Mn'))
'Some zalgo text'
Community
  • 1
  • 1
dawg
  • 98,345
  • 23
  • 131
  • 206