I am reading in hundreds of XML files and parsing them with xml.etree.ElementTree.
Quick background just fwiw: These XML files were at one point totally valid but somehow when processing them historically my process which copied/pasted them may have corrupted them. (Turns out it was a flushing issue / with statement not closing, if you care, see the good help I got on that investigation at... Python shutil copyfile - missing last few lines ).
Anyway back to the point of this question.
I would still like to read in the first 100,000 lines or so of these documents which are valid XML. The files are only missing the last 4 or 5KB of a 6MB file. As alluded to earlier, though, the file just 'cuts out'. it looks like this:
</Maintag>
<Maintag>
<Change_type>NQ</Change_type>
<Name>Atlas</Name>
<Test>ATLS</Test>
<Other>NYSE</Other>
<Scheduled_E
where (perhaps obviously) Scheduled_E is the beginning of what should be another attribute, <.Scheduled_Event>, say. But the file gets cut short mid tag. Once again, before this point in the file, there are several thousand 'good' "Maintag" entries which I would like to read in, accepting the cutoff entry (and obviously anything that should have come after) as an unrecoverable fail.
A simple but incomplete method of dealing with this might be to simply - pre XML processing - look for the last instance of the string <./Maintag> in the file, and replace what follows (which will be broken, at some point) with the 'opening' tags. Again, this at least lets me process what is still there and valid.
If someone wants to help me out with that sort of string replacement, then fwiw the opening tags are:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<FirstTag>
<Source FileName="myfile">
I am hoping that even easier than that, there might be an elementtree or beautifulsoup or other way of handling this situation... I've done a decent amount of searching and nothing seems easy/obvious.
Thanks