Given a text file (or unicode string), what is a good way to detect characters that fall outside of the ascii encoding? I could easily just iterate pass each character to ord()
, but I wonder if there's a more efficient, elegant, or idiomatic way to do it.
The ultimate goal here is to compile a list of characters in the data that cannot encode to ascii.
In case it matters, the size of my corpus is approx 500MB / 1200 text files. Running (pre-compiled vanilla) Python 3.3.1 on Win7 (64bit).