Trying to read a tab-separated file into pandas dataframe:
>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False)
It errors out like so:
b'Skipping line 58: expected 11 fields, saw 12\n'
Traceback (most recent call last):
...(many lines)...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 115: invalid start byte
It seems the byte 0xc0 causes pain at both utf-8 and ascii encodings.
>>> df = pd.read_table(fn , na_filter=False, error_bad_lines=False, encoding='ascii')
...(many lines)...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 115: ordinal not in range(128)
I ran into the same issues with csv module's reader too.
If I import the file into OpenOffice Calc, it gets imported properly, the columns are properly recognized etc. Probably the offending 0xc0 byte is ignored there. This is not some vital piece of the data etc, it's probably just a fluke write error by the system that generated this file. I'll be happy to even zap the line where his occurs if it comes to that. I just want to read the file into the python program. The error_bad_lines=False
option of pandas ought to have taken care of this problem but no dice. Also, the file does NOT have any content in non-english scripts that makes unicode so necessary. It's all standard english letters and numbers. I tried utf-16 utf-32 etc too but they only caused more errors of their own.
How to make python (pandas Dataframe in particular) read a file having one or more rogue byte 0xc0 characters?