Skip non-decodeable characters in Python file reader

Question

I have a csv file, which I want to read with Python. When I use the following code snippet, I get an error.

with open(input_file, 'r') as file:
    self.md = file.read()

UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 58658: ordinal not in range(128)

or

with open(input_file, 'r', encoding='ascii') as file:
    # START INFINITE LOO
    while (True):
        self.md = file.readline()
        print (self.md)
    # END INFINITE LOOP

UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 1314: ordinal not in range(128)

or

with open(input_file, 'r', encoding='utf8') as file:
    self.md = file.read()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 58658: invalid start byte

or

with open(input_file, 'r') as file:
    # START INFINITE LOOP
    while (True):
        self.md = file.readline()
        print (self.md)
    # END INFINITE LOOP

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 1314: invalid start byte

When I open the file in TextWrangler or in Excel, I don't see any strange characters in it, even when I select Display Invisibles in TextWrangler. Some other strange observation: it's always line 1380 where it goes wrong, even when I delete lines 1370-1390 from the file. This makes me wonder if there is even a wrong character in that line.

Is there a way to read the file and to simply skip non-decodeable characters?

EDIT

This is a hex dump around the problematic area. Position 58658 is position E522 in hexadecimal. The 89 in the second field in the second line seems to be the culprit.

000e510: 3436 3822 3b22 4152 454d 4920 2020 2020  468";"AREMI     
000e520: 6e6f 8922 3b3b 3b0a 2246 3130 3030 3134  no.";;;."F100014
000e530: 3639 223b 2230 3030 3134 3639 223b 2245  69";"0001469";"E

EDIT 2

It turns out that using encoding windows-1250, I can read the file. The question remains: is it possible to read the file assuming UTF-8, and skipping byte sequences that cannot be read?

There are probably ways to do that, but the proper solution is to ensure that your input is in a well-defined encoding. Are you sure it's really supposed to be in UTF-8? Can you provide a hex dump of the bytes around the problematic sequence at position 58658 (meaning, that many bytes into the file)? See also [the `character-encoding` tag wiki](http://stackoverflow.com/tags/character-encoding/info) for some troubleshooting tips. — tripleee, May 28 '15 at 07:00
I think we are missing some information here. Just reading from a file wouldn't trigger a UnicodeError. — knitti, May 28 '15 at 07:03
And yet it does. ANd no, there is not more Python code. The csv file itself, I received from someone else, so I don't know how they made it. — physicalattraction, May 28 '15 at 07:41

score 1 · Answer 1 · answered May 28 '15 at 07:08

1

None of your first two snippets could possibly raise a UnicodeDecodeError - only the third one (which is quite braindead FWIW - infinite loop indeed), when it hits the print(self.md) statement. The problem is not with reading the file but with your stdout not handling the encoding.

Also I don't think you really understand what Unicode is - there's no such thing as a "non-unicode character". I strongly suggest you read this article about unicode and encodings.

answered May 28 '15 at 07:08

bruno desthuilliers

75,974
6
88
118

Thanks for the link, this makes the terminology clearer. So I have a byte sequence that cannot be decoded using UTF-8 to a correct Unicode character. – physicalattraction May 28 '15 at 07:40
Why would you use utf-8 if it's windows-1250 ??? Oh and by the way: are you using Python 2.x or Python 3.x ? – bruno desthuilliers May 28 '15 at 11:00
I just figured out that that encoding was Windows-1250. I want to make a generic file reader that can red files from all sources. I have done this now with a try: except UnicodeDecodeError: statement. I use Python 3.4. – physicalattraction May 28 '15 at 11:48
"I use Python 3.4" uh ok that's why you get these errors reading files I guess - Python 3 strings are Unicode only. You may want to read the doc about `open()`, text mode and encodings : https://docs.python.org/3/library/functions.html#open - perhaps using binary mode would be safer here, but it won't solve the base problem: if you don't know how your file has been encoded, you won't be able to decode it to unicode. The best answer you can find on this problem is probably here : http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file – bruno desthuilliers May 28 '15 at 12:04

Skip non-decodeable characters in Python file reader

1 Answers1