xml.etree.ElementTree and unicode findtext

Question

I am trying to parse Medline xml documents using iterparse in the xml.etree.ElementTree module. All is working well except that some of the text includes non-ascii characters. I do not see a way of handling unicode using findtext. Any suggestions?

score 2 · Answer 1 · answered Nov 03 '11 at 13:59

2

Have you tried opening the file with utf8 encoding flah:

fd = open('some.xml', mode='r', encoding='utf-8')
xml.etree.ElementTree.iterparse(fd)

Or use decode:

fd = open('some.xml', mode='r')
sio = StringIO(fd.read().decode("utf-8"))
xml.etree.ElementTree.iterparse(sio)

answered Nov 03 '11 at 13:59

chown

51,908
16
134
170

I think this should work, but I'm still getting errors. Next step is to validate that the encoding is, indeed, UTF-8 – seandavi Nov 03 '11 at 15:07

score 0 · Answer 2 · edited May 23 '17 at 12:11

0

This was a very useful post in addition to the answer above.

Reading utf-8 characters from a gzip file in python

edited May 23 '17 at 12:11

Community

1
1

answered Nov 03 '11 at 15:08

seandavi

2,818
4
25
52

xml.etree.ElementTree and unicode findtext

2 Answers2