0

Goal: To convert .xlsx file to .csv file, then read into pandas dataframe. (pd.read_excel() works too).

Breakdown: If I write xlsx file to csv, the csv will error with pd.read_csv(..., but if I manually create or change the file it works fine.

import pandas as pd; import csv; import xlrd

# open excel file
wb = xlrd.open_workbook(my_xlsx)
sh = wb.sheet_by_name('Sheet1')

# write each row from excel to csv
with open(my_csv, 'w', newline='') as csvFile:
    wr = csv.writer(csvFile, quoting=csv.QUOTE_MINIMAL)
    for rownum in range(sh.nrows):
        wr.writerow(sh.row_values(rownum))

# read into dataframe
df = pd.read_csv(my_csv)       # <-- this step throws the byte error

This throws the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 17: invalid start byte

Fixes: If I alter my_csv file at all after writing by copy all/paste/save, or just edit something and save, there is no longer a byte error. I haven't found a way to fix the problem without manual intervention.

Question: What is happening here? Does this error have something to do with windows text handling? Is there a way to fix this by forcing UTF-8 encoding in the write step?

I haven't had a chance to test it in Linux, but expect it to work if this is indeed a Windows plain text error. This is a learning point for me more than a hang-up, but help will be appreciated!

A few related questions: Invalid Start Byte, CSV Windows problems

rocket_brain
  • 69
  • 1
  • 8
  • `open` on windows doesn't default to `utf-8` encoding. Be explicit about the encoding when open for reading or writing and never be surprised. Use `locale.getpreferredencoding(False)` if you want to know the default used by [open](https://docs.python.org/3/library/functions.html#open). – Mark Tolonen Dec 28 '19 at 01:42
  • Good explanation. My Windows defaults to 'cp1252' encoding. – rocket_brain Jan 03 '20 at 16:11

1 Answers1

1

Maybe you need to open csv with encoding='utf-8' option.

Here is code.

import pandas as pd; import csv; import xlrd

# open excel file
wb = xlrd.open_workbook(my_xlsx)
sh = wb.sheet_by_name('Sheet1')

# write each row from excel to csv
with open(my_csv, 'w', encoding='utf-8', newline='') as csvFile:
    wr = csv.writer(csvFile, quoting=csv.QUOTE_MINIMAL)
    for rownum in range(sh.nrows):
        wr.writerow(sh.row_values(rownum))

# read into dataframe
df = pd.read_csv(my_csv)
joonghyup cha
  • 624
  • 3
  • 12