3

Given a text file (or unicode string), what is a good way to detect characters that fall outside of the ascii encoding? I could easily just iterate pass each character to ord(), but I wonder if there's a more efficient, elegant, or idiomatic way to do it.

The ultimate goal here is to compile a list of characters in the data that cannot encode to ascii.

In case it matters, the size of my corpus is approx 500MB / 1200 text files. Running (pre-compiled vanilla) Python 3.3.1 on Win7 (64bit).

Sukotto
  • 2,472
  • 6
  • 26
  • 31

1 Answers1

9

The ultimate goal here is to compile a list of characters in the data that cannot encode to ascii.

The most efficient method I can think of would be to use re.sub() to strip out any valid ASCII characters, which should leave you with a string containing all the non-ASCII characters.

This will just strip out the printable characters...

>>> import re
>>> print re.sub('[ -~]', '', u'£100 is worth more than €100')
£€

...or if you want to include the non-printable characters, use this...

>>> print re.sub('[\x00-\x7f]', '', u'£100 is worth more than €100')
£€

To eliminate the dupes, just create a set() of the returned string...

>>> print set(re.sub('[\x00-\x7f]', '', u'£€£€'))
set([u'\xa3', u'\u20ac'])
Aya
  • 39,884
  • 6
  • 55
  • 55