Why is read_csv failing with invalid start byte error?

Question

Goal: To convert .xlsx file to .csv file, then read into pandas dataframe. (pd.read_excel() works too).

Breakdown: If I write xlsx file to csv, the csv will error with pd.read_csv(..., but if I manually create or change the file it works fine.

import pandas as pd; import csv; import xlrd

# open excel file
wb = xlrd.open_workbook(my_xlsx)
sh = wb.sheet_by_name('Sheet1')

# write each row from excel to csv
with open(my_csv, 'w', newline='') as csvFile:
    wr = csv.writer(csvFile, quoting=csv.QUOTE_MINIMAL)
    for rownum in range(sh.nrows):
        wr.writerow(sh.row_values(rownum))

# read into dataframe
df = pd.read_csv(my_csv)       # <-- this step throws the byte error

This throws the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 17: invalid start byte

Fixes: If I alter my_csv file at all after writing by copy all/paste/save, or just edit something and save, there is no longer a byte error. I haven't found a way to fix the problem without manual intervention.

Question: What is happening here? Does this error have something to do with windows text handling? Is there a way to fix this by forcing UTF-8 encoding in the write step?

I haven't had a chance to test it in Linux, but expect it to work if this is indeed a Windows plain text error. This is a learning point for me more than a hang-up, but help will be appreciated!

A few related questions: Invalid Start Byte, CSV Windows problems

`open` on windows doesn't default to `utf-8` encoding. Be explicit about the encoding when open for reading or writing and never be surprised. Use `locale.getpreferredencoding(False)` if you want to know the default used by [open](https://docs.python.org/3/library/functions.html#open). — Mark Tolonen, Dec 28 '19 at 01:42

score 1 · Accepted Answer · answered Dec 28 '19 at 00:49

Maybe you need to open csv with encoding='utf-8' option.

Here is code.

import pandas as pd; import csv; import xlrd

# open excel file
wb = xlrd.open_workbook(my_xlsx)
sh = wb.sheet_by_name('Sheet1')

# write each row from excel to csv
with open(my_csv, 'w', encoding='utf-8', newline='') as csvFile:
    wr = csv.writer(csvFile, quoting=csv.QUOTE_MINIMAL)
    for rownum in range(sh.nrows):
        wr.writerow(sh.row_values(rownum))

# read into dataframe
df = pd.read_csv(my_csv)

Why is read_csv failing with invalid start byte error?

1 Answers1