0

I have this exact problem: https://www.en.adwords-community.com/t5/Basics-for-New-Advertisers/Character-Encoding-used-by-the-editor/td-p/100244 (tl;dr: trying to upload a file to google, contains foreign characters, they look funny when opened in excel and google is rejecting them for not being properly encoded)

I have the following code. Note that I've tried adding a byte order mark to the beginning of the http response object, as well as tried to encode all strings as utf-8.

<some code where workbook is created and populated via xlwt>
output = StringIO.StringIO()
workbook.save(output)
wb = open_workbook(file_contents=output.getvalue())
sheet = wb.sheet_by_name(spreadsheet)

response = HttpResponse(content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename='+(account.name+'-'+spreadsheet).replace(',', '')+'.csv'
response.write('\xEF\xBB\xBF')
writer = csv.writer(response)
for rownum in xrange(sheet.nrows):
    newRow = []
    for s in sheet.row_values(rownum):
        if isinstance(s,unicode):
            newRow.append(s.encode("utf-8"))
        elif isinstance(s, float):
            newRow.append(int(s))
        else:
            newRow.append(s.decode('utf-8'))
    writer.writerow(newRow)
return response

But they still don't look right when opened in Excel! Why?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Colleen
  • 23,899
  • 12
  • 45
  • 75
  • Why do you `decode` what is already a `utf-8` string? You risk that the `csv` module might choose another encoding when it writes it. – Mark Ransom Jun 09 '14 at 16:01
  • Because I'm dumb and don't fully understand python encoding/decoding stuff and I read http://stackoverflow.com/questions/10406135/unicodedecodeerror-ascii-codec-cant-decode-byte-0xd1-in-position-2-ordinal and I don't know... – Colleen Jun 09 '14 at 16:09
  • Just replace that part with a `pass` then and see what happens. – Mark Ransom Jun 09 '14 at 16:13
  • Post this as an answer and I'll give you the bounty! It worked! – Colleen Jun 09 '14 at 16:34

2 Answers2

2

You want to write encoded data always, but for string values you are decoding to Unicode values:

else:
    newRow.append(s.decode('utf-8'))

Most likely your web framework is encoding that data to Latin-1 instead in that case.

Just append the value without decoding:

for s in sheet.row_values(rownum):
    if isinstance(s, unicode):
        s = s.encode("utf-8"))
    elif isinstance(s, float):
        s = int(s)
    newRow.append(s)

Further tips:

  • It's a good idea to communicate the character set in the response headers too:

    response = HttpResponse(content_type='text/csv; charset=utf-8')
    
  • Use codecs.BOM_UTF8 to write the BOM instead of hardcoding the value. Much less error prone.

    response.write(codecs.BOM_UTF8)
    
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
2

Whenever you write a Unicode string to a file or stream it must be encoded. You can do the encoding yourself, or you can let the various module and library functions attempt to do it for you. If you're not sure what encoding will be selected for you, and you know which encoding you want written, it's better to do the encoding yourself.

You've followed this advice already when you encounter a Unicode string in the input. However, when you encounter a string that's already encoded as UTF-8, you decode it back to Unicode! This results in the reverse conversion being done in writerow, and evidently it's not picking utf-8 as the default encoding. By leaving the string alone instead of decoding it the writerow will write it out exactly as you intended.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622