-1

Given an XML with chinese characters, I would like to use xml.etree to help me parse the XML to do some processing. The English version works. For example:

>el.xml printf '%s\n' $'<?xml version=\'1.0\' encoding=\'utf8\'?><Color>Grey</Color>'
>cl.xml printf '%s\n' $'<?xml version=\'1.0\' encoding=\'utf8\'?><Color>灰色</Color>'

tryParse() {
  python -c 'import xml.etree.ElementTree as ET; import sys; ET.parse(sys.argv[1])' "$@"
}

tryParse el.xml && printf '%s\n\n' "English works"
tryParse cl.xml && printf '%s\n\n' "Chinese works"

...emits as output:

English works

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1182, in parse
    tree.parse(source, parser)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
    parser.feed(data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 44
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
bryan.blackbee
  • 1,934
  • 4
  • 32
  • 46

1 Answers1

1

Use lxml instead:

>>> import lxml.etree as ET
>>> doc = ET.parse('cl.xml')
>>> print doc.getroot().text
灰色
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441