Efficient XML parsing for 25GB data

Question

My aim is to parse 25 GB of XML data. An example of such a data is given below:

<Document>
<Data Id='12' category='1'  Body="abc"/>
<Data Id='13' category='1'  Body="zwq"/>
.
.
<Data Id='82018030' category='2' CorrespondingCategory1Id='13' Body="pqr"/>

However..considering the data I have of "25 GB"...my approach is quite inefficient. Please suggest some way to improve my code or an alternate approach. Also kindly include a small example code to make things clearer.

possible duplicate of [Python sax to lxml for 80+GB XML](http://stackoverflow.com/questions/9809469/python-sax-to-lxml-for-80gb-xml) — Francis Avila, Apr 05 '12 at 20:03

kindall · Accepted Answer · 2012-04-05T20:02:12.230

You might find that a SAX parser works better for this task. Rather than building a DOM, a SAX parser turns the XML file into a stream of elements and calls functions you provide to let you handle each element.

The good thing is that SAX parsers can be very fast and memory-efficient compared to DOM parsers, and some don't even need to be given all the XML at once, which would be ideal when you have 25 GB of it.

Unfortunately, if you need any context information, like "I want tag <B> but only if it's inside tag <A>," you must maintain it yourself, since all the parser gives you is "start tag <A>, start tag <B>, end tag <B>, end tag <A>." It never explicitly tells you that tag <B> is inside tag <A>, you have to figure that out from what you saw. And once you have seen an element, it's gone unless you remembered it yourself.

This gets very hairy for complex parsing jobs, but yours is probably manageable.

It happens that Python's standard library has a SAX parser in xml.sax. You probably want something like xml.sax.xmlreader.IncrementalParser.

Thanks for the help. Can u please explain the concept with a small example code...that would be really kind of you...you may construct your own simple example also. — Jannat Arora, Apr 05 '12 at 20:01
If I get a chance I will. Not sure I'll have the time, though. — kindall, Apr 05 '12 at 20:08

score 0 · Answer 2 · answered Apr 05 '12 at 19:41

0

My first suggestion from looking at your question would be to use a relational database such as MySQL or sqlite. It wouldn't be hard to get your XML data into this form, and then querying on that data would be both more straightforward and faster.

answered Apr 05 '12 at 19:41

RotaJota

831
9
14

I don't want to take that approach...essentially the join of the tables also for 25GB of data takes a lot of time. Therefore..I am looking for an alternative with parsing itself – Jannat Arora Apr 05 '12 at 19:43
For the kind of query you are making here with this much data, you really need a database. If you define the indexes properly it will be *much* faster and easier than doing what you suggest. So use SAX or other incremental parser to load the data into a database, then query from the database to write out to a new XML file. – Francis Avila Apr 05 '12 at 20:07

score 0 · Answer 3 · answered Apr 05 '12 at 20:04

Your initial algorithm runs in O(n^2), which will be very slow for 25GB of data. Ideally you will get it down to O(n) or O(n log n). In the absence of any other information about the data (like whether category 1 or 2 is smaller, etc), you can do something like this (which is O(n)):

from lxml import objectify
f=open('myfile25GB', 'r')
text=f.read()
root=objectify.fromstring(text)

cat_one_bodies = {}
for e in root.attrib['Document'].row:
    category = int(e.attrib['category'])
    body = e.attrib['Body']
    if category == 1:
        e_id = int(e.attrib['Id'])
        cat_one_bodies[e_id] = body
    else: #Assuming there are only 2 categories
        cat_one_id = int(e.attrib['CorrespondingCategory1Id'])
        print "Cat1 Body: '%s' Cat2 Body: '%s'" % (body, cat_one_bodies[cat_one_id])

Although this doesn't parse your file, hopefully it shows you the idea. It potentially uses quite a bit of memory (since it's maintaining all the category1 bodies in the dictionary), so that may be a consideration.

score 0 · Answer 4 · answered Apr 05 '12 at 22:19

In XSLT 3.0 (draft) as currently implemented in Saxon-EE, you can write a streaming transformation that solves this problem as follows:

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:map="http://www.w3.org/2005/xpath-functions/map">
<xsl:mode streamable="yes"/>
<xsl:template match="/">
  <xsl:iterate select="Document/Data">
    <xsl:param name="map" select="map{}"/>
    <xsl:choose>
      <xsl:when test="@category='1'">
        <xsl:next-iteration>
          <xsl:with-param name="map" select="map:put($map, string(@Id), string(@Body))"/>
        </xsl:next-iteration>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="'Cat1 Body: ', 
                              $map(@CorrespondingCategoryId), 'Cat2 Body', @Body"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:iterate>
</xsl:template>

I haven't tested this (it's late at night on the eve of a four-day holiday...) but if you are interested in pursuing this approach I'll be happy to help. XSLT 3.0 is still a draft spec and fairly fluid. Its focus is on solving problems like this one using a streaming approach that handles very large documents using bounded memory. Saxon-EE 9.4 implements a snapshot of the spec.

score 0 · Answer 5 · answered Apr 05 '12 at 22:32

If the IDs are in ascending order, then you can roll out your own function that reads an element at any position in a file. Then you can just scan the entire file and for every element you can find the corresponding one using binary search algorithm. The thing will run in O(n log n) while still using negligible amount of memory.

score 0 · Answer 6 · answered May 08 '12 at 11:53

0

Try and use iterparse from lxml. I think it will suit the problem which you wish to handle.

answered May 08 '12 at 11:53

Jannat Arora

2,759
8
44
70

Here is a similar question solved by .iterparse() http://stackoverflow.com/a/9814580/1346705 – pepr May 08 '12 at 12:15

Efficient XML parsing for 25GB data

6 Answers6