3

I'm trying to parse a large XML file which is being received from the network in Python.

In order to do that, I get the data and pass it to lxml.etree.iterparse

However, if the XML has yet to fully be sent, like so:

<MyXML>
    <MyNode foo="bar">
    <MyNode foo="ba

If I run etree.iterparse(f, tag='MyNode').next() I get an XMLSyntaxError at whereever it was cut off.

Is there any way I can make it so I can receive the first tag (i.e. the first MyNode) and only get an exception when I reach that part of the document? (To make lxml really 'stream' the contents and not read the whole thing in the beginning).

KimiNewt
  • 501
  • 3
  • 14

2 Answers2

2

XMLPullParser and HTMLPullParser may better suite your needs. They get their data by repeated calls to parser.feed(data). You still have to wait until all of the data comes in before the tree is usable.

tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • That's problematic though, as the data I'm receiving "may" never be fully received. – KimiNewt Dec 19 '14 at 19:15
  • That is a problem. As of python 3.4, lxml has [Incremental event parsing](http://lxml.de/parsing.html#incremental-event-parsing). Maybe you could keep track of the element stack and feed closing elements to the parser when your data stream dies. It wouldn't work if you were mid-element though (` – tdelaney Dec 19 '14 at 19:42
  • I need to support python 2.7, and I do need to support being mid-element, too. – KimiNewt Dec 20 '14 at 07:53
-1

Try to learn from the answers of two related questions to your problem. Find more wisdom in more related answers. Your problem is very common, may be you need to tweak it a bit to fit into a proven solution. Prefer that way to create a stable solution.

Community
  • 1
  • 1
Sascha Gottfried
  • 3,303
  • 20
  • 30