Processing large XML file with libxml-ruby chunk by chunk

Question

I'd like to read a large XML file that contains over a million small bibliographic records (like <article>...</article>) using libxml in Ruby. I have tried the Reader class in combination with the expand method to read record by record but I am not sure this is the right approach since my code eats up memory. Hence, I'm looking for a recipe how to conveniently process record by record with constant memory usage. Below is my main loop:

   File.open('dblp.xml') do |io|
      dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
      pubFactory = PubFactory.new

      i = 0
      while dblp.read do
        case dblp.name
          when 'article', 'inproceedings', 'book': 
            pub = pubFactory.create(dblp.expand)
            i += 1
            puts pub
            pub = nil
            $stderr.puts i if i % 10000 == 0
            dblp.next
          when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
            # ignore for now
            dblp.next 
          else
            # nothing
        end
      end  
    end

The key here is that dblp.expand reads an entire subtree (like an <article> record) and passes it as an argument to a factory for further processing. Is this the right approach?

Within the factory method I then use high-level XPath-like expression to extract the content of elements, like below. Again, is this viable?

def first(root, node)
    x = root.find(node).first
    x ? x.content : nil
end

pub.pages   = first(node,'pages') # node contains expanded node from dblp.expand

Just a small follow up: after some more testing with Ruby 1.8.7 on OS X 10.6 on x86 and Debian Linux on x86 I ran into seg faults on both machines while reading the XML file. I would guess the bug stems from libxml-ruby but I did not track it down so far. Quite disappointing. — Christian Lindig, Jan 12 '10 at 20:10

paradigmatic · Accepted Answer · 2010-01-04T20:12:29.983

When processing big XML files, you should use a stream parser to avoid loading everything in memory. There are two common approaches:

Push parsers like SAX, where you react to encoutered tags as you get them (see tadman answer).
Pull parsers, where you control a "cursor" in the XML file that you can move with simple primitives like go up/go down etc.

I think that push parsers are nice to use if you want to retrieve just some fields, but they are generally messy to use for complex data extraction and are often implemented whith use case... when... constructs

Pull parser are in my opinion a good alternative between a tree-based model and a push parser. You can find a nice article in Dr. Dobb's journal about pull parsers with REXML .

Thanks for the pointer. `XML::Reader` is indeed a pull parser based on a cursor that is advanced using `next` and that can read an entire sub-tree using `expand`. My code is working except that it leaks memory and I suspect that this is caused by some basic misunderstanding about how to use it on big files. Any XML::Reader expert wants to comment? — Christian Lindig, Jan 04 '10 at 21:24

score 1 · Answer 2 · answered Jan 04 '10 at 15:19

1

When processing XML, two common options are tree-based, and event-based. The tree-based approach typically reads in the entire XML document and can consume a large amount of memory. The event-based approach uses no additional memory but doesn't do anything unless you write your own handler logic.

The event-based model is employed by the SAX-style parser, and derivative implementations.

Example with REXML: http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch08s01.html

REXML: http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html

answered Jan 04 '10 at 15:19

tadman

208,517
23
234
262

1

I am aware of tree-based vs. stream-based parsing. According to the API documentation XML::Reader parses the stream and models a cursor. The latter is advanced by `next` and `expand`. However, the documentation lacks a good example how to use it for big files. – Christian Lindig Jan 04 '10 at 21:12
1

Examples are always a problem, yeah. I prefer tree-based parsers, they're usually much easier to use, but for instances like this you're stuck using something more SAXy. The good news is that a lot of Java code examples, which are built around the SAX method, are fairly portable to Ruby. Looks like paradigmatic has a better solution, though. – tadman Jan 05 '10 at 18:35

score 0 · Answer 3 · answered Feb 11 '10 at 13:36

I had the same problem, but I think I solved it by calling Node#remove! on the expanded node. In your case, I think you should do something like

my_node = dblp.expand
[do what you have to do with my_node]
dblp.next
my_node.remove!

Not really sure why this works, but if you look at the source for LibXML::XML::Reader#expand, there's a comment about freeing the node. I am guessing that Reader#expand associates the node to the Reader, and you have to call Node#remove! to free it.

Memory usage wasn't great, even with this hack, but at least it didn't keep on growing.

Thanks. It still does not work for me as I run out of memory. However, reading the file in a loop calling next without using expand does work. I suspect a memory leak in the expand method. — Christian Lindig, Mar 26 '10 at 20:36

Processing large XML file with libxml-ruby chunk by chunk

3 Answers3

Linked