Read a file in python having rogue byte 0xc0 that causes utf-8 and ascii to error out

Question

Trying to read a tab-separated file into pandas dataframe:

>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False)

It errors out like so:

b'Skipping line 58: expected 11 fields, saw 12\n'
Traceback (most recent call last):
...(many lines)...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 115: invalid start byte

It seems the byte 0xc0 causes pain at both utf-8 and ascii encodings.

>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False, encoding='ascii')
...(many lines)...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 115: ordinal not in range(128)

I ran into the same issues with csv module's reader too.
If I import the file into OpenOffice Calc, it gets imported properly, the columns are properly recognized etc. Probably the offending 0xc0 byte is ignored there. This is not some vital piece of the data etc, it's probably just a fluke write error by the system that generated this file. I'll be happy to even zap the line where his occurs if it comes to that. I just want to read the file into the python program. The error_bad_lines=False option of pandas ought to have taken care of this problem but no dice. Also, the file does NOT have any content in non-english scripts that makes unicode so necessary. It's all standard english letters and numbers. I tried utf-16 utf-32 etc too but they only caused more errors of their own.

How to make python (pandas Dataframe in particular) read a file having one or more rogue byte 0xc0 characters?

Maybe read the file as binary, and then decode it using UTF-8 with a less strict codec. The `decode` method accepts an optional parameter where you can say replace invalid sequences with U+FFFD or simply discard them. — tripleee, Apr 15 '18 at 18:57

score 1 · Answer 1 · answered Apr 15 '18 at 18:44

1

Moving this answer here from another place where it got a hostile reception.

Found one standard that actually accepts (meaning, doesn't error out) byte 0xc0 :

encoding="ISO-8859-1"

Note: This entails making sure the rest of the file doesn't have unicode chars. This may be helpful for folks like me who didn't have any unicode chars in their file anyways and just wanted python to load the damn thing and both utf-8 and ascii encodings were erroring out.

More on ISO-8859-1 : What is the difference between UTF-8 and ISO-8859-1?

New command that works:

>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False, encoding='ISO-8859-1')

After reading it in, the dataframe is fine, the columns, data are all working like they did in OpenOffice Calc. I still have no idea where the offending 0xc0 byte went but it doesn't matter as I've got the data I needed.

answered Apr 15 '18 at 18:44

Nikhil VJ

5,630
7
34
55

ISO-8859-1 works for this particular byte, but is undefined for the range 0x80-0x9f. A popular (but also reviled) encoding which can handle any 8-bit byte is Windows code page 1252. – tripleee Apr 15 '18 at 18:54
I wouldn't call that "hostile". A bit negative perhaps. – tripleee Apr 15 '18 at 18:58
@tripleee thanks for the tip.. could you give this completeness by specifying what exactly to supply in the `encoding=` parameter? I'm guessing it's not going to be `encoding='Windows code page 1252'` – Nikhil VJ Apr 16 '18 at 18:37
1

Python uses `encoding='cp1252'` though there are some aliases in the code (I think `'windows-1252'` would work too, for example). – tripleee Apr 17 '18 at 04:14

Read a file in python having rogue byte 0xc0 that causes utf-8 and ascii to error out

1 Answers1

Linked