1

I'm trying to iterate through an XML file (UTF-8 encoded, starts with ) with lxml, but get the following error on the character 丂 :

UnicodeEncodeError: 'cp932' codec can't encode character u'\u4e02' in position 0: illegal multibyte sequence

Other characters before this are printed out correctly. The code is:

parser = etree.XMLParser(encoding='utf-8')
tree = etree.parse("filename.xml", parser)
root = tree.getroot()
for elem in root:
    print elem[0].text

Does the error mean that it didn't parse the file in utf-8 but in shift JIS instead?

Benedikt Waldvogel
  • 12,406
  • 8
  • 49
  • 61
blub
  • 8,757
  • 4
  • 27
  • 38

2 Answers2

2

The stacktrace of the UnicodeEncodeError points to the location where the exception occurs. Unfortunately you didn’t include it but it’s most likely the last line where the unicode text is printed to stdout. I assume that stdout uses cp932 encoding on your system.

If my assumptions are correct you should consider changing your environment such that stdout uses an encoding that can represent unicode characters (like UTF-8). (see for example Writing unicode strings via sys.stdout in Python).

Community
  • 1
  • 1
Benedikt Waldvogel
  • 12,406
  • 8
  • 49
  • 61
  • Oh so it was just stdouts encoding, I didn't realize that! I was using it just for testing, so I didn't have a problem after all :D Thank you! – blub Dec 12 '12 at 07:13
2

I had a similar situation using lxml's objectify. Here's how I was able to fix it.

import unicodedata
my_name = root.name.text
if isinstance(my_name, unicode):
    # Decode to string.
    my_name = unicodedata.normalize('NFKD', my_name).encode('ascii','ignore')
paragbaxi
  • 3,965
  • 8
  • 44
  • 58