0

I have run into a character encoding problem as follows:

rating = 'Barntillåten'
new_file = codecs.open(os.path.join(folder, "metadata.xml"), 'w', 'utf-8')
new_file.write(

"""<?xml version="1.0" encoding="UTF-8"?>
   <ratings>
        <rating system="%s">%s</rating>
   </ratings>""" % (values['rating_system'], rating))

The error I get is:

  File "./assetshare.py", line 314, in write_file
    </ratings>""" % (values['rating_system'], rating))

I know that the encoding error is related to Barntillåten, because if I replace that word with test, the function works fine.

Why is this encoding error happening and what do I need to do to fix it?

jfs
  • 399,953
  • 195
  • 994
  • 1,670
David542
  • 104,438
  • 178
  • 489
  • 842

3 Answers3

3

rating must be a Unicode string in order to contain Unicode codepoints.

rating = u'Barntillåten'

Otherwise, in Python 2, the non-Unicode string 'Barntillåten' contains bytes (encoded with whatever your source encoding was), not codepoints.

ephemient
  • 198,619
  • 38
  • 280
  • 391
2

In Python 2, codecs.open expects to read and write unicode objects. You're passing it a str.

The fix is to ensure that the data you pass it is unicode:

new_file.write((

"""<?xml version="1.0" encoding="UTF-8"?>
"""<ratings>
        <rating system="%s">%s</rating>
   </ratings>""" % (values['rating_system'], rating)
).decode('utf-8'))

If you use unicode literals (u"...") then Python will try to ensure that all data is unicode. Here it would be sufficient to have rating = u'Barntillåten':

rating = u'Barntillåten'
new_file = codecs.open(os.path.join(folder, "metadata.xml"), 'w', 'utf-8')
new_file.write(

"""<?xml version="1.0" encoding="UTF-8"?>
"""<ratings>
        <rating system="%s">%s</rating>
   </ratings>""" % (values['rating_system'], rating))

You can write into a codecs.open file a str object, but only if the str is encoded in the default encoding, which means that for safety that's only safe if the str is plain ASCII. The default encoding is and should be left as ASCII; see Changing default encoding of Python?

Community
  • 1
  • 1
ecatmur
  • 152,476
  • 27
  • 293
  • 366
  • 1
    `.decode('utf-8')` assumes that `values['rating_system']` and `rating` are bytes representing utf-8 encoded text. In this case it requires the source code character encdoding to be utf-8. So it is better to use Unicode literals in the first place without `.decode()` later – jfs Aug 21 '12 at 22:49
1

You need to use unicode literals.

u'...'
u"..."
u'''......'''
u"""......"""
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358