5

I am reading in hundreds of XML files and parsing them with xml.etree.ElementTree.

Quick background just fwiw: These XML files were at one point totally valid but somehow when processing them historically my process which copied/pasted them may have corrupted them. (Turns out it was a flushing issue / with statement not closing, if you care, see the good help I got on that investigation at... Python shutil copyfile - missing last few lines ).

Anyway back to the point of this question.
I would still like to read in the first 100,000 lines or so of these documents which are valid XML. The files are only missing the last 4 or 5KB of a 6MB file. As alluded to earlier, though, the file just 'cuts out'. it looks like this:

</Maintag>




<Maintag>
    <Change_type>NQ</Change_type>
    <Name>Atlas</Name>
    <Test>ATLS</Test>
    <Other>NYSE</Other>
    <Scheduled_E

where (perhaps obviously) Scheduled_E is the beginning of what should be another attribute, <.Scheduled_Event>, say. But the file gets cut short mid tag. Once again, before this point in the file, there are several thousand 'good' "Maintag" entries which I would like to read in, accepting the cutoff entry (and obviously anything that should have come after) as an unrecoverable fail.

A simple but incomplete method of dealing with this might be to simply - pre XML processing - look for the last instance of the string <./Maintag> in the file, and replace what follows (which will be broken, at some point) with the 'opening' tags. Again, this at least lets me process what is still there and valid.

If someone wants to help me out with that sort of string replacement, then fwiw the opening tags are:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<FirstTag>
    <Source FileName="myfile">

I am hoping that even easier than that, there might be an elementtree or beautifulsoup or other way of handling this situation... I've done a decent amount of searching and nothing seems easy/obvious.

Thanks

Community
  • 1
  • 1
10mjg
  • 573
  • 1
  • 6
  • 18
  • No DOM parser is able to work on incomplete (hence incorrect) XML. A Sax parser will crash when it gets at the "cutoff point" but you can possibly use one catch the exception and implement the "homeless person's method". – bruno desthuilliers Jul 21 '15 at 15:22
  • OP's solution of "homeless person's method" seems more like a bandaid than actually fixing the problem, which IMO is how to read and parse the XML files without losing content. Otherwise, if you're just going to drop content after an arbitrary number of bytes, what's the point of parsing for meaning at all? – shoover Jul 21 '15 at 16:18
  • The easy way: `echo '>' >> myfile.xml`. – kirbyfan64sos Jul 21 '15 at 17:01
  • This page seems to have some ... competing but detailed views ... I may try one or the other of Purrell or John Machin's ideas ... http://stackoverflow.com/questions/2352840/parsing-broken-xml-with-lxml-etree-iterparse – 10mjg Jul 22 '15 at 05:59

1 Answers1

6

For dealing with unclosed elements -or token as in the title of this questioin-, I'd recommend to try lxml. lxml's XMLParser has recover option which documented as :

recover - try hard to parse through broken XML

For example, given a broken XML as follow :

from lxml import etree

xml = """
<root>
    <Maintag>
        <Change_type>NQ</Change_type>
        <Name>Atlas</Name>
        <Test>ATLS</Test>
        <Other>NYSE</Other>
        <Scheduled_E
"""
parser = etree.XMLParser(recover=True)
doc = etree.fromstring(xml, parser=parser)
print(etree.tostring(doc))

The recovered XML as printed by the above code is as follow :

<root>
    <Maintag>
        <Change_type>NQ</Change_type>
        <Name>Atlas</Name>
        <Test>ATLS</Test>
        <Other>NYSE</Other>
        <Scheduled_E/></Maintag></root>
har07
  • 88,338
  • 12
  • 84
  • 137
  • Thanks for the attention. In your example xml string, you include the 'closing' <./root> tag at the end. To be clear, I am missing all closing tags, as the file just cuts out a few KB before it should. (I had a flushing issue when creating the file =( ). (plz help my vocabulary if I could refer to those 'closing tags' as something more technical). – 10mjg Jul 22 '15 at 06:01
  • @10mjg I've just tried without `` and `lxml` still able to deal with it, answer updated accordingly – har07 Jul 22 '15 at 06:04
  • well in that case, this sounds like something that will Work. i will test this tomorrow. Thank you very much!! – 10mjg Jul 22 '15 at 06:06
  • My sense after initial testing is that this is working wonderfully for me. I just check later on via dict structures to make sure I have all the tags I need to avoid processing the broken entry. Thank you! – 10mjg Jul 22 '15 at 13:49