Goal: To convert .xlsx file to .csv file, then read into pandas dataframe. (pd.read_excel() works too).
Breakdown: If I write xlsx file to csv, the csv will error with pd.read_csv(..., but if I manually create or change the file it works fine.
import pandas as pd; import csv; import xlrd
# open excel file
wb = xlrd.open_workbook(my_xlsx)
sh = wb.sheet_by_name('Sheet1')
# write each row from excel to csv
with open(my_csv, 'w', newline='') as csvFile:
wr = csv.writer(csvFile, quoting=csv.QUOTE_MINIMAL)
for rownum in range(sh.nrows):
wr.writerow(sh.row_values(rownum))
# read into dataframe
df = pd.read_csv(my_csv) # <-- this step throws the byte error
This throws the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 17: invalid start byte
Fixes: If I alter my_csv file at all after writing by copy all/paste/save, or just edit something and save, there is no longer a byte error. I haven't found a way to fix the problem without manual intervention.
Question: What is happening here? Does this error have something to do with windows text handling? Is there a way to fix this by forcing UTF-8 encoding in the write step?
I haven't had a chance to test it in Linux, but expect it to work if this is indeed a Windows plain text error. This is a learning point for me more than a hang-up, but help will be appreciated!
A few related questions: Invalid Start Byte, CSV Windows problems