Parsing large xml data using python's elementtree

Question

I'm currently learning how to parse xml data using elementtree. I got an error that say:ParseError: not well-formed (invalid token): line 1, column 2.

My code is right below, and a bit of the xml data is after my code.

import xml.etree.ElementTree as ET

tree = ET.fromstring("C:\pbc.xml")
root = tree.getroot()


for article in root.findall('article'):
    print ' '.join([t.text for t in pub.findall('title')])
    for author in article.findall('author'):
        print 'Author name: {}'.format(author.text)
    for journal in article.findall('journal'):  # all venue tags with id attribute
        print 'journal'

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2002-01-03" key="persons/Codd71a">
<author>E. F. Codd</author>
<title>Further Normalization of the Data Base Relational Model.</title>
<journal>IBM Research Report, San Jose, California</journal>
<volume>RJ909</volume>
<month>August</month>
<year>1971</year>
<cdrom>ibmTR/rj909.pdf</cdrom>
<ee>db/labs/ibm/RJ909.html</ee>
</article>

<article mdate="2002-01-03" key="persons/Hall74">
<author>Patrick A. V. Hall</author>
<title>Common Subexpression Identification in General Algebraic Systems.</title>
<journal>Technical Rep. UKSC 0060, IBM United Kingdom Scientific Centre</journal>
<month>November</month>
<year>1974</year>
</article>

unutbu · Answer 1 · 2013-05-20T14:50:31.503

1

with open("C:\pbc.xml", 'rb') as f:
    root = ET.fromstring(f.read().strip())

Unlike ET.parse, ET.fromstring expects a string with XML content, not the name of a file.

Also in contrast to ET.parse, ET.fromstring returns a root Element, not a Tree. So you should omit

root = tree.getroot()

Also, the XML snippet you posted needs a closing </dblp> to be parsable. I assume your real data has that closing tag...

The iterparse provided by xml.etree.ElementTree does not have a tag argument, although lxml.etree.iterparse does have a tag argument.

Try:

import xml.etree.ElementTree as ET
import htmlentitydefs

filename = "test.xml"
# http://stackoverflow.com/a/10792473/190597 (lambacck)
parser = ET.XMLParser()
parser.entity.update((x, unichr(i)) for x, i in htmlentitydefs.name2codepoint.iteritems())
context = ET.iterparse(filename, events = ('end', ), parser=parser)
for event, elem in context:
    if elem.tag == 'article':
        for author in elem.findall('author'):
            print 'Author name: {}'.format(author.text)
        for journal in elem.findall('journal'):  # all venue tags with id attribute
            print(journal.text)
        elem.clear()

Note: To use iterparse your XML must be valid, which means among other things that there can not be empty lines at the beginning of the file.

edited May 20 '13 at 14:50

answered May 18 '13 at 13:38

unutbu

842,883
184
1,785
1,677

Hi unutbu, I did exactly what you suggested and got the following error: ParseError: no element found: line 21, column 0. – user2274879 May 18 '13 at 13:52
Remove all empty lines from the beginning of the file, or else add `.strip()` to `f.read()` (see above.) – unutbu May 18 '13 at 13:54
@user2274879: Your XML document is cut off; there should be more data beyond line 21, but if your XML document matches what you posted here exactly, then *at least* the `` closing tag is missing. – Martijn Pieters May 18 '13 at 13:54
@unutbu: It's line 21 that is the problem. The XML in the OP has no more than 21 lines, and it is missing data beyond that. – Martijn Pieters May 18 '13 at 13:55
@MartijnPieters: The error you point out is correct, but not the *immediate* error the OP is experiencing. Notice that the error occurs on column 0, not column 10. – unutbu May 18 '13 at 13:56
@unutbu: good point. I suspect that it has more to do with the `.fromstring()` and the `.read()` here. Why not just use `.parse()` instead? – Martijn Pieters May 18 '13 at 13:57
@MartijnPieters: Using `parse` is a great idea, I just didn't think of it :) – unutbu May 18 '13 at 13:59
Hi Unutbu, after including the closing tag, I got the following error: TypeError: iterparse() got an unexpected keyword argument 'tag' – user2274879 May 18 '13 at 14:24
@user2274879: `xml.etree.ElementTree.iterparse` does not have a `tag` argument -- though [lxml.etree.iterparse](http://lxml.de/) does have a `tag` argument. I highly recommend using lxml if you can install it, but if not, I've added some code above which should work with `xml.etree.ElementTree`. – unutbu May 18 '13 at 14:46
Author name: E. F. Codd journal Author name: Patrick A. V. Hall journal – user2274879 May 18 '13 at 16:33
Hi Unutbu, i used your code and it parsed, but not very well. I also got an error message at the end. I got the following result:Author name: E. F. Codd journal Author name: Patrick A. V. Hall journal. I am interested in the journal type, not just the word journal. I also got the following error:ParseError: no element found: line 22, column 0. I just believe there is little problem somewhere either with the code or with the data. – user2274879 May 18 '13 at 16:35
Given the size of the xml data, it is not possible to carryout manual editing. Is there a way to modify the code to acccount for those unacceptable characters that may affect the xml parser? – user2274879 May 18 '13 at 20:55
HI Unutbu, the updated code is working quite well, but the xml parsing often stops whenever there is a character like & in a word (ie Z&uuml). getting rid of those characters or adjusting the code so that the parser ignores such unacceptable characters may be the solution for now. – user2274879 May 18 '13 at 22:15
Why that is happening is a mystery to me. It might help if we could see a few lines of the XML around the point causing the problem. – unutbu May 19 '13 at 12:19
@unutbu, the character causing the problem is & (amp). I have spent hours translating words with &, and the code seems to be working perfectly well on translated words. The issue now is, the xml data is just too large for me to continue this way. How then can I modify the code to ignore & character?....thats the perfect solution. – user2274879 May 20 '13 at 10:01
The function [xml.sax.saxutils.unescape](http://docs.python.org/2/library/xml.sax.utils.html#xml.sax.saxutils.unescape) can convert `&` to a semicolon. I've edited my post to show how to use it. – unutbu May 20 '13 at 11:32
@unutbu, i tried using the function (sax.unescape function), as well as your code, it only works when the &amp characters are deleted from the xml data. I will quickly read throught the sax.unescape function and see how I can modify the code. – user2274879 May 20 '13 at 12:17
If you can post a snippet of what the problematic XML looks like, we can give you more accurate help. Guessing is painful... – unutbu May 20 '13 at 12:20
@unutbu, how can I send the xml data over? I haven't done it before. – user2274879 May 20 '13 at 13:33
You don't have post the whole thing, just a few lines that include the `&`. Make sure it shows opening and closing tags so we can see some context. – unutbu May 20 '13 at 13:35
Markus Tresch Principles of Distributed Object Database Languages. technical Report 248, ETH Zürich, Dept. of Computer Science July 1996
– user2274879 May 20 '13 at 14:40

Martijn Pieters · Answer 2 · 2013-05-18T14:05:14.933

1

You are using .fromstring() instead of .parse():

import xml.etree.ElementTree as ET

tree = ET.parse("C:\pbc.xml")
root = tree.getroot()

.fromstring() expects to be given the XML data in a bytestring, not a filename.

If the document is really large (many megabytes or more) then you should use the ET.iterparse() function instead and clear elements you have processed:

for event, article in ET.iterparse('C:\\pbc.xml', tag='article'):
    for title in aarticle.findall('title'):
        print 'Title: {}'.format(title.txt)
    for author in article.findall('author'):
        print 'Author name: {}'.format(author.text)
    for journal in article.findall('journal'):
        print 'journal'

    article.clear()

edited May 18 '13 at 14:05

answered May 18 '13 at 13:56

Martijn Pieters

1,048,767
296
4,058
3,343

Hi Pieters, I used iterparse, as well as the code you put forward, however, I got the following error:ParseError: no element found: line 21, column 0. – user2274879 May 18 '13 at 14:19
@user2274879: Then there appears to be a problem with your input XML file. Use a XML validator to check for errors and fix them before trying to parse the file with Python. – Martijn Pieters May 18 '13 at 14:20
Thanks a lot. I'll use xml validator to check for errors, however, the main xml data is extremely large. – user2274879 May 18 '13 at 16:58
@user2274879: Take the first 100 lines or so, making sure you get a complete XML document (make sure it has complete `
` elements and a closing `` tag at the end).
– Martijn Pieters May 18 '13 at 17:02
Markus Tresch Principles of Distributed Object Database Languages. technical Report 248, ETH Zürich, Dept. of Computer Science July 1996
– user2274879 May 20 '13 at 14:10
Does that validate in a XML validator? Does it include line 21 at the very least? – Martijn Pieters May 20 '13 at 14:36
The xml validator i used idenfied certain characters such as &amp (as in Zürich) as problematic to the xml parser. I had to translate words like Zürich (meaning Zurich) before parsing. I intend to modify the code in such a way that &amps will be ignored during parssing. Any ideas will be appreciated given the size of the data. – user2274879 May 20 '13 at 14:51
The `ü` is a HTML entity, yes, and stands for u-umlaut: `ü`; they are not normally legal in XML documents indeed. You could try `resolve_entities=False`, then after the fact resolve them using [`HTMLParser.unescape()`](http://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string). – Martijn Pieters May 20 '13 at 14:55

score 0 · Answer 3 · answered May 18 '13 at 13:36

0

You'd better not putting the meta-info of the xml file into the parser. The parser do well if the tags are well-closed. So the <?xml may not be recognized by the parser. So omit the first two lines and try again. :-)

answered May 18 '13 at 13:36

lichenbo

1,019
11
13

Hi Lichenbo, I removed the first two lines, and i still got the same error. – user2274879 May 18 '13 at 13:46

Parsing large xml data using python's elementtree

3 Answers3