2

I am trying to parse Medline xml documents using iterparse in the xml.etree.ElementTree module. All is working well except that some of the text includes non-ascii characters. I do not see a way of handling unicode using findtext. Any suggestions?

seandavi
  • 2,818
  • 4
  • 25
  • 52

2 Answers2

2

Have you tried opening the file with utf8 encoding flah:

fd = open('some.xml', mode='r', encoding='utf-8')
xml.etree.ElementTree.iterparse(fd)

Or use decode:

fd = open('some.xml', mode='r')
sio = StringIO(fd.read().decode("utf-8"))
xml.etree.ElementTree.iterparse(sio)
chown
  • 51,908
  • 16
  • 134
  • 170
  • I think this should work, but I'm still getting errors. Next step is to validate that the encoding is, indeed, UTF-8 – seandavi Nov 03 '11 at 15:07
0

This was a very useful post in addition to the answer above.

Reading utf-8 characters from a gzip file in python

Community
  • 1
  • 1
seandavi
  • 2,818
  • 4
  • 25
  • 52