Detecting non-ascii characters in unicode string

Question

Given a text file (or unicode string), what is a good way to detect characters that fall outside of the ascii encoding? I could easily just iterate pass each character to ord(), but I wonder if there's a more efficient, elegant, or idiomatic way to do it.

The ultimate goal here is to compile a list of characters in the data that cannot encode to ascii.

In case it matters, the size of my corpus is approx 500MB / 1200 text files. Running (pre-compiled vanilla) Python 3.3.1 on Win7 (64bit).

have a look at [str.translate](http://docs.python.org/library/stdtypes.html#str.translate) — Fredrik Pihl, May 31 '13 at 21:29

Aya · Answer 1 · 2013-05-31T21:54:28.097

The ultimate goal here is to compile a list of characters in the data that cannot encode to ascii.

The most efficient method I can think of would be to use re.sub() to strip out any valid ASCII characters, which should leave you with a string containing all the non-ASCII characters.

This will just strip out the printable characters...

>>> import re
>>> print re.sub('[ -~]', '', u'£100 is worth more than €100')
£€

...or if you want to include the non-printable characters, use this...

>>> print re.sub('[\x00-\x7f]', '', u'£100 is worth more than €100')
£€

To eliminate the dupes, just create a set() of the returned string...

>>> print set(re.sub('[\x00-\x7f]', '', u'£€£€'))
set([u'\xa3', u'\u20ac'])

Detecting non-ascii characters in unicode string

1 Answers1

Linked