2

We've been using libxml-ruby for a couple of years. It is fantastic on files of 30 MB or less, but it is PLAGUED by seg faults. Nobody at the project really seems to care to fix them, only to blame these on 3rd party software. That's their prerogative of course, it's free.

Yet I still am unable to read these large files. I suppose I could write some miserable hack to split them into smaller files, but I would like to avoid that. Does anyone else have any experience with reading very large XML files in Ruby?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
RubyRedGrapefruit
  • 12,066
  • 16
  • 92
  • 193

4 Answers4

6

When loading big files, whether they are xml or not, you should start considering taking pieces at a time(in this case called streaming), rather than loading the entire file into memory.

I would highly suggest reading this article about pull parsers. Using this technique will allow you to read this file with greater ease, rather than loading all of the file at once into memory.

Mike Lewis
  • 63,433
  • 20
  • 141
  • 111
4

Thanks everyone for your excellent input. I was able to solve my problem by looking at Processing large XML file with libxml-ruby chunk by chunk.

The answer was to avoid the use of:

reader.expand

and to instead use:

reader.read

or:

reader.next

in conjunction with:

reader.node

As long as you aren't trying to store the node as is, it works great. You want to operate on that node immediately, because reader.next will blow it away.

To respond to an earlier answer, from what I can understand libxml-ruby IS a streaming parser. The problems with the seg faults arose in garbage collecting issues which were causing memory leaks galore. Once I learned not to use reader.expand, everything came up roses.

UPDATE:

I was NOT able to solve my problem after all. There appears to be NO WAY to get to the subtree without using reader.expand.

And so I guess there is no way to read a read and parse a large XML file with libxml-ruby? The reader.expand memory leak bug has been open without even a response since 2009? FAIL FAIL FAIL.

Community
  • 1
  • 1
RubyRedGrapefruit
  • 12,066
  • 16
  • 92
  • 193
3

I'd recommend looking into a SAX XML parser. They're designed to handle huge files. I haven't needed one in a while, but but they're pretty easy to use; As it reads the XML file in it will pass your code various events, which you catch and handle with your code.

The Nokogiri site has a link to SAX Machine which is based on Nokogiri, so that would be another option. Either way, Nokogiri is very well supported, and used by a lot of people, including me for all HTML and XML I parse. It supports both DOM and SAX parsing, allows use of CSS and XPath accessors, and uses libxml2 for its parsing, so it's fast and based on a standard parsing library.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
3

libxml-ruby indeed has plenty of bugs, not just crashing bugs, but version incompatibilities, memory leaks, etc...

I highly recommend Nokogiri. The Ruby community has rallied around Nokogiri as the new hotness for fast XML parsing. It has a reader pull parser, a SAX parser, and your standard in-memory DOM-ish parser.

For really large XML files, I'd recommend Reader, because it's as fast as SAX, but is easier to program for, because you don't have to keep track of so much state manually.

John Douthat
  • 40,711
  • 10
  • 69
  • 66
  • What is this Reader you speak of John? – RubyRedGrapefruit Mar 16 '11 at 06:33
  • sorry, I meant Nokogiri::XML::Reader, Nokogiri's pull parser http://nokogiri.org/Nokogiri/XML/Reader.html – John Douthat Mar 16 '11 at 06:39
  • I do use Nokogiri on small files, but these files are too large for Nokogiri to handle. – RubyRedGrapefruit Mar 16 '11 at 13:36
  • Nokogiri is written in C, and its XML::Reader and XML::SAX parsers stream the data, so they are appropriate for very large documents and -- they use very little memory and are very fast. The XML::Document parser is of course inappropriate for large documents. Please give XML::Reader a chance - Nokogiri is so much more fun to work with than libxml's frustration. – John Douthat Mar 16 '11 at 17:05
  • John, I see what you mean. I have implemented Nokogiri::XML::Reader, and I'm "each"ing through the nodes. My only problem is that I don't know how to coerce the Reader object to a Node object, so that I can get node.content, etc. When you run reader.each, each "node" it serves up to the block is not a Node element, but an instance of Reader! Help! – RubyRedGrapefruit Mar 16 '11 at 17:18
  • to get the content of a node, use the #inner_xml method, which returns a string – John Douthat Mar 16 '11 at 21:04
  • I'm doing that now, and I'm doing Hash.from_xml to get a hash representation of the node. Wow, is that ever SLOW. – RubyRedGrapefruit Mar 17 '11 at 15:13