How to handle Encoding Errors in Python

Question

I am working on parsing a large number (90000) csv files. Some of the files are converted to text from pdf. So, they have a lot of noise in the form of weird characters. For example, Cachï¿. Some of these files have been converted online and some through pdfminer. Now, in my program I parse the files and remove the stop words.

cleanedRow = ' '.join([word for word in row[1].split() if word not in stopWrds])

But due to these weird encoding/decoding issues, my program fails. I cannot delete all such characters searching through 90000 files. The program throws the following error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

Is there an elegant way to ignore these characters in python?. Would appreciate any help. Thanks

The most elegant way to handle encodings is to upgrade to Python 3. — Antti Haapala -- Слава Україні, Feb 23 '17 at 08:19
You have to have some idea what the encoding in question is. Trying Python3 might help some but fundamentally, you have to know or guess the encoding. — pvg, Feb 23 '17 at 08:20
Are the files utf-8 compliant? If so, try to explicitly decode them via utf-8 (decode method) instead of the very limited ascii codec. — languitar, Feb 23 '17 at 08:20
These files are given to me from a different source so I have no idea what encoding has been used while converting them. — CodeSsscala, Feb 23 '17 at 08:22
Is there a way to ignore such characters without the program failing? — CodeSsscala, Feb 23 '17 at 08:41
@nltk: Unsure that it is a duplicate because OP does not explicitely use `decode` not knows the original encoding. Anyway, I have posted [an answer](http://stackoverflow.com/a/42411093/3545273) in the other question — Serge Ballesta, Feb 23 '17 at 09:03
There's nothing elegant in ignoring single characters – it will leave you with broken text. You can decode with `errors="ignore"`, but that's definitely not elegant. You should try to guess the correct encoding. If UTF8 works, you'll be done quickly. — lenz, Feb 23 '17 at 14:14
That's fine if for now I get broken text. I need a quick solution. I'll most definitely try to get it done elegantly later — CodeSsscala, Feb 23 '17 at 15:10

How to handle Encoding Errors in Python

0 Answers0

Linked