I have a very large XML log file(s) that auto-splits at a fixed size (~200MB). There can be many parts (usually less than 10). When it splits it doesn't do it neatly at end of a record or even at the end of the current line. It just splits as soon as it hits the target size.
Basically I need to parse these files for 'record' elements then pull out the time
child from each among other things
Since these log files split at a random location and don't necessarily have a root, I was using Python3 and lxml's etree.iterparse
with html=True
. This is handling the lack of root node due to split files. However, I am not sure how to handle the records that end up being split between the end of one file and the start of another.
Here is a small sample of what a split file might look like.
FILE: test.001.txt
<records>
<record>
<data>5</data>
<time>1</time>
</record>
<record>
<data>5</data>
<time>2</time>
</record>
<record>
<data>5</data>
<ti
FILE: test.002.txt
me>3</time>
</record>
<record>
<data>6</data>
<time>4</time>
</record>
<record>
<data>6</data>
<time>5</time>
</record>
</records>
Here is what I have tried which I know doesn't work correctly:
from lxml import etree
xmlFiles = []
xmlFiles.append('test.001.txt')
xmlFiles.append('test.002.txt')
timeStamps = []
for xmlF in xmlFiles:
for event, elem in etree.iterparse(xmlF, events=("end",), tag='record',html=True):
tElem = elem.find('time')
if tElem is not None:
timeStamps.append(int(tElem.text))
Output:
In[20] : timeStamps
Out[20]: [1, 2, 4, 5]
So is there an easy way to capture the 3rd record which is split between files? I don't really want to merge the files ahead of time since there can be lots of them and they are pretty large. Also, any other speed/ memory management tips besides this Using Python Iterparse For Large XML Files ... I'll figure out how to do that next. The appending of timeStamps seems like it might be problematic since there could be lots of them ... but I can't really allocate since I have no idea how many there are ahead of time.