I have a csv file, which I want to read with Python. When I use the following code snippet, I get an error.
with open(input_file, 'r') as file:
self.md = file.read()
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 58658: ordinal not in range(128)
or
with open(input_file, 'r', encoding='ascii') as file:
# START INFINITE LOO
while (True):
self.md = file.readline()
print (self.md)
# END INFINITE LOOP
UnicodeDecodeError: 'ascii' codec can't decode byte 0x89 in position 1314: ordinal not in range(128)
or
with open(input_file, 'r', encoding='utf8') as file:
self.md = file.read()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 58658: invalid start byte
or
with open(input_file, 'r') as file:
# START INFINITE LOOP
while (True):
self.md = file.readline()
print (self.md)
# END INFINITE LOOP
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 1314: invalid start byte
When I open the file in TextWrangler or in Excel, I don't see any strange characters in it, even when I select Display Invisibles in TextWrangler. Some other strange observation: it's always line 1380 where it goes wrong, even when I delete lines 1370-1390 from the file. This makes me wonder if there is even a wrong character in that line.
Is there a way to read the file and to simply skip non-decodeable characters?
EDIT
This is a hex dump around the problematic area. Position 58658 is position E522 in hexadecimal. The 89 in the second field in the second line seems to be the culprit.
000e510: 3436 3822 3b22 4152 454d 4920 2020 2020 468";"AREMI
000e520: 6e6f 8922 3b3b 3b0a 2246 3130 3030 3134 no.";;;."F100014
000e530: 3639 223b 2230 3030 3134 3639 223b 2245 69";"0001469";"E
EDIT 2
It turns out that using encoding windows-1250
, I can read the file. The question remains: is it possible to read the file assuming UTF-8, and skipping byte sequences that cannot be read?