I've been trying to parse some huge XML files that LXML won't grok, so I'm forced to parse them with xml.sax
.
class SpamExtractor(sax.ContentHandler):
def startElement(self, name, attrs):
if name == "spam":
print("We found a spam!")
# now what?
The problem is that I don't understand how to actually return
, or better, yield
, the things that this handler finds to the caller, without waiting for the entire file to be parsed. So far, I've been messing around with threading.Thread
and Queue.Queue
, but that leads to all kinds of issues with threads that are really distracting me from the actual problem I'm trying to solve.
I know I could run the SAX parser in a separate process, but I feel there must be a simpler way to get the data out. Is there?
As to deleting nodes, I don't see where that is needed, could you explain?- Explained seconds later by larsmans. – Gareth Latty Jan 15 '12 at 22:31