lxml encoding error when parsing utf8 xml

Question

I'm trying to iterate through an XML file (UTF-8 encoded, starts with ) with lxml, but get the following error on the character 丂 :

UnicodeEncodeError: 'cp932' codec can't encode character u'\u4e02' in position 0: illegal multibyte sequence

Other characters before this are printed out correctly. The code is:

parser = etree.XMLParser(encoding='utf-8')
tree = etree.parse("filename.xml", parser)
root = tree.getroot()
for elem in root:
    print elem[0].text

Does the error mean that it didn't parse the file in utf-8 but in shift JIS instead?

score 2 · Accepted Answer · edited May 23 '17 at 12:22

2

The stacktrace of the UnicodeEncodeError points to the location where the exception occurs. Unfortunately you didn’t include it but it’s most likely the last line where the unicode text is printed to stdout. I assume that stdout uses cp932 encoding on your system.

If my assumptions are correct you should consider changing your environment such that stdout uses an encoding that can represent unicode characters (like UTF-8). (see for example Writing unicode strings via sys.stdout in Python).

edited May 23 '17 at 12:22

Community

1
1

answered Dec 07 '12 at 15:28

Benedikt Waldvogel

12,406
8
49
61

Oh so it was just stdouts encoding, I didn't realize that! I was using it just for testing, so I didn't have a problem after all :D Thank you! – blub Dec 12 '12 at 07:13

score 2 · Answer 2 · answered Oct 08 '13 at 21:33

2

I had a similar situation using lxml's objectify. Here's how I was able to fix it.

import unicodedata
my_name = root.name.text
if isinstance(my_name, unicode):
    # Decode to string.
    my_name = unicodedata.normalize('NFKD', my_name).encode('ascii','ignore')

answered Oct 08 '13 at 21:33

paragbaxi

3,965
8
44
58

Worked perfectly for `r = requests.get(...)` that would not work in `objectify.XML(r.text)` – Juha Untinen Mar 09 '16 at 09:41

lxml encoding error when parsing utf8 xml

2 Answers2

Linked