Parse large split XML file(s) with Python

Question

I have a very large XML log file(s) that auto-splits at a fixed size (~200MB). There can be many parts (usually less than 10). When it splits it doesn't do it neatly at end of a record or even at the end of the current line. It just splits as soon as it hits the target size.

Basically I need to parse these files for 'record' elements then pull out the time child from each among other things

Since these log files split at a random location and don't necessarily have a root, I was using Python3 and lxml's etree.iterparse with html=True. This is handling the lack of root node due to split files. However, I am not sure how to handle the records that end up being split between the end of one file and the start of another.

Here is a small sample of what a split file might look like.

FILE: test.001.txt

<records>
<record>
    <data>5</data>
    <time>1</time>
</record>
<record>
    <data>5</data>
    <time>2</time>
</record>
<record>
    <data>5</data>
    <ti

FILE: test.002.txt

me>3</time>
</record>
<record>
    <data>6</data>
    <time>4</time>
</record>
<record>
    <data>6</data>
    <time>5</time>
</record>
</records>

Here is what I have tried which I know doesn't work correctly:

from lxml import etree
xmlFiles      = []
xmlFiles.append('test.001.txt')
xmlFiles.append('test.002.txt')
timeStamps = []
for xmlF in xmlFiles:
    for event, elem in etree.iterparse(xmlF, events=("end",), tag='record',html=True):
        tElem = elem.find('time')
        if tElem is not None:
            timeStamps.append(int(tElem.text))

Output:

In[20] : timeStamps
Out[20]: [1, 2, 4, 5]

So is there an easy way to capture the 3rd record which is split between files? I don't really want to merge the files ahead of time since there can be lots of them and they are pretty large. Also, any other speed/ memory management tips besides this Using Python Iterparse For Large XML Files ... I'll figure out how to do that next. The appending of timeStamps seems like it might be problematic since there could be lots of them ... but I can't really allocate since I have no idea how many there are ahead of time.

larsks · Accepted Answer · 2015-07-31T17:30:16.433

5

Sure. Create a class that acts like a file (by providing a read method), but that actually takes input from multiple files while hiding this fact from the caller. Something like:

class Reader (object):
    def __init__(self):
        self.files = []

    def add(self, src):
        self.files.append(src)

    def read(self, nbytes=0):
        if not len(self.files):
            return bytes()

        data = bytes()
        while True:
            data = data + self.files[0].read(nbytes - len(data))
            if len(data) == nbytes:
                break

            self.files[0].close()
            self.files.pop(0)
            if not len(self.files):
                break

        return data

This class maintains a list of open files. If a read request can't be satisfied by the "topmost" file, that file is closed and a read is attempted from the subsequent file. This continues until we read enough bytes or we run out of files.

Given the above, if we do this:

r = Reader()
for path in ['file1.txt', 'file2.txt']:
    r.add(open(path, 'rb'))

for event, elem in etree.iterparse(r):
    print event, elem.tag

We get (using your sample input):

end data
end time
end record
end data
end time
end record
end data
end time
end record
end data
end time
end record
end data
end time
end record
end records

edited Jul 31 '15 at 17:30

answered Jul 31 '15 at 02:42

larsks

277,717
41
399
399

I can't test it now but this looks exactly like what I need! – Aero Engy Jul 31 '15 at 02:57
Unfortunately running your example does not work for me. I get the following: `TypeError: reading file objects must return bytes objects` FYI, I am using Python3 if you didn't catch that in the my original question ... not sure if that is part of the issue. – Aero Engy Jul 31 '15 at 17:22
Yeah, this was tested in python2. Let me take a look at python3. – larsks Jul 31 '15 at 17:23
The code now runs correctly under python3, by treating everything as a byte string (the `bytes` type). Note that the `open(...)` call needs the `rb` mode flag for this to work. – larsks Jul 31 '15 at 17:31
This is an excellent answer, however, in the intervening years, a new library has been written to accomplish exactly this: http://pypi.org/project/split-file-reader The `SplitFileReader` class can open a series of individual files and expose a single object with a `read` attribute to seamlessly expose the underlying list of files as a single file. Disclaimer: I am the author of this module, and wrote it to solve this very problem. – Reivax Apr 30 '23 at 22:40

Parse large split XML file(s) with Python

1 Answers1