I am trying to parse Medline xml documents using iterparse in the xml.etree.ElementTree module. All is working well except that some of the text includes non-ascii characters. I do not see a way of handling unicode using findtext. Any suggestions?
Asked
Active
Viewed 1,281 times
2 Answers
2
Have you tried opening the file with utf8 encoding flah:
fd = open('some.xml', mode='r', encoding='utf-8')
xml.etree.ElementTree.iterparse(fd)
Or use decode:
fd = open('some.xml', mode='r')
sio = StringIO(fd.read().decode("utf-8"))
xml.etree.ElementTree.iterparse(sio)

chown
- 51,908
- 16
- 134
- 170
-
I think this should work, but I'm still getting errors. Next step is to validate that the encoding is, indeed, UTF-8 – seandavi Nov 03 '11 at 15:07
0
This was a very useful post in addition to the answer above.