21

I'm trying to do some data work in Python pandas and having trouble writing out my results. I read my data in as a CSV file and been exporting each script as it's own CSV file which works fine. Lately though I've tried exporting everything in 1 Excel file with worksheets and a few of the sheets give me an error

"'utf8' codec can't decode byte 0xe9 in position 1: invalid continuation byte"

I have no idea how to even start finding any characters that could be causing problems exporting to Excel. Not sure why it exports to CSV just fine though :(

relevant lines

from pandas import ExcelWriter
data = pd.read_csv(input)
writer = ExcelWriter(output) #output is just the filename
fundraisers.to_excel(writer, "fundraisers")
locations.to_excel(writer, "locations") #error
locations.to_csv(outputcsv) #works
writer.save()

printing head of offending dataframe

Event ID    Constituent ID  Email Address   First Name  \   Last Name
f       1       A       A       1
F       4       L       R       C
M       1       1       A       D
F       4       A       A       G
M       2       0       R       G
M       3       O       O       H
M       2       T       E       H
M       2       A       A       H
M       2       M       M       K
F       3       J       E       K
Location ID raised  raised con  raised email
a   0   0   0
a   8   0   0
o   0   0   0
o   0   0   0
o   0   0   0
t   5   0   0
o   1   0   0
o   6   a   0
o   6   0   0
d   0   0   0

looking at the excel sheet I do actually get a partial print out. Anything in the first name column and beyond are blank, but event, constituent and email all print.

edit: Trying to read the csv in as utf8 fails, but reading it in as latin1 works. Is there a way to specify the to_excel encoding? Or decode and encode my dataframe to utf8?

Wizuriel
  • 3,617
  • 4
  • 21
  • 26
  • Can you post some of your possibly offending unicode characters? Generating random Hebrew characters and then writing them works okay for me. – Phillip Cloud Sep 05 '13 at 20:38
  • 2
    I have no idea how to find which character it finds offensive. – Wizuriel Sep 05 '13 at 20:43
  • One reason that CSV is different is because the Excel exporting is not done by `pandas` it is done by third-party libraries (`xlwt` and/or `openpyxl`). – Phillip Cloud Sep 05 '13 at 20:43
  • Since the error message says "... in position 1", you can start by looking at every *second* character of every one of your unicode columns. Just show a few of them. – Phillip Cloud Sep 05 '13 at 20:45
  • 1
    related: http://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte – Phillip Cloud Sep 05 '13 at 20:46
  • 5
    I've had problems in the past where I was reading CSV files into Pandas that were utf8 encoded without specifying the encoding. e.g. try pd.read_csv(input, encoding='utf-8') and similar on write. Or maybe you have 'latin-1' input data - seems to pop up in google search for your error. – patricksurry Sep 06 '13 at 01:31

6 Answers6

23

Managed to solve this.

I made a function that goes through my columns that have strings and managed to decode/encode them into utf8 and it now works.

def changeencode(data, cols):
    for col in cols:
        data[col] = data[col].str.decode('iso-8859-1').str.encode('utf-8')
    return data   
Wizuriel
  • 3,617
  • 4
  • 21
  • 26
  • 8
    Thanks! This function was quite useful. It would be helpful for Pandas to have this type of capability built in, instead of just throwing a less than insightful error (or at least identifying the first row and column in which it encountered the problem). – murphsp1 Dec 11 '14 at 10:46
14

In my case, the problem was that I was initially reading the CSV file with the wrong encoding (ASCII instead of cp1252). Therefore, when pandas tried to write it to an Excel file, it found some characters it couldn't decode.

I solved it by specifying the correct encoding when reading the CSV file.

data = pd.read_csv(fname, encoding='cp1252')
Zenadix
  • 15,291
  • 4
  • 26
  • 41
5

Actually, there is a way to force utf8 encoding by passing a parameter to ExcelWriter:

 ew = pandas.ExcelWriter('test.xlsx',options={'encoding':'utf-8'})
 sampleList = ['Miño', '1', '2', 'señora']
 dataframe = pandas.DataFrame(sampleList)
 dataframe.to_excel(ew)
 ew.save()
user3570953
  • 59
  • 1
  • 1
2

The simplest thing is to load your dataframe in utf-8. Then it ExcelWriter will save it no problem.

data = pd.read_csv(path,encoding='utf-8')
billmanH
  • 1,298
  • 1
  • 14
  • 25
1

don't know when it's going to be released but you can try with my github repository:

https://github.com/jtornero/pandas

You can clone it and build pandas from source; the issue is almost solved and it works like

sampleList = ['Miño', '1', '2', 'señora']
dataframe = pandas.DataFrame(sampleList)
ew = pandas.ExcelWriter('./test.xls', encoding='utf-8')
dataframe.to_excel(ew)
ew.save()

Cheers

Jorge Tornero

0

Similar to what was said by @Zenadix, reading the csvs in as UTF-8 allowed the ExcelWriter to write without an error.

df = pd.read_csv('path', encoding='utf-8')

...

with pd.ExcelWriter('new_path') as writer:
    df.to_excel(writer, sheet_name='Foo')


Jhirschibar
  • 215
  • 1
  • 3
  • 16