7

I have a very large XML file (300mb) of the following format:

<data>
 <point>
  <id><![CDATA[1371308]]></id>
  <time><![CDATA[15:36]]></time>
 </point>
 <point>
  <id><![CDATA[1371308]]></id>
  <time><![CDATA[15:36]]></time>
 </point>
 <point>
  <id><![CDATA[1371308]]></id>
  <time><![CDATA[15:36]]></time>
 </point>
</data>

Now I need to read it and iterate through the point nodes doing something for each. Currently I'm doing it with Nokogiri like this:

require 'nokogiri'
xmlfeed = Nokogiri::XML(open("large_file.xml"))
xmlfeed.xpath("./data/point").each do |item|
  save_id(item.xpath("./id").text)
end

However that's not very efficient, since it parses everything whole hug, and hence creating a huge memory footprint (several GB).

Is there a way to do this in chunks instead? Might be called streaming if I'm not mistaken?

EDIT

The suggested answer using nokogiris sax parser might be okay, but it gets very messy when there is several nodes within each point that I need to extract content from and process differently. Instead of returning a huge array of entries for later processing, I would much rather prefer if I could access one point at a time, process it, and then move on to the next "forgetting" the previous.

Niels Kristian
  • 8,661
  • 11
  • 59
  • 117
  • +1 for presenting it in well manner.. – Arup Rakshit Jan 16 '14 at 14:06
  • http://amolnpujari.wordpress.com/2012/03/31/reading_huge_xml-rb/ I also found ox is 5 times faster than nokogiri while reading a large xml Plus I have a wrapper written which simply allow you to search through large xml using ox, allows you to iterate with specified element. https://gist.github.com/amolpujari/5966431 – Amol Pujari Mar 11 '14 at 10:32

3 Answers3

6

Given this little-known (but AWESOME) gist using Nokogiri's Reader interface, you should be able to do this:

Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
  inside_element 'point' do
    for_element 'id' do puts "ID: #{inner_xml}" end
    for_element 'time' do puts "Time: #{inner_xml}" end
  end
end

Someone should make this a gem, perhaps me ;)

Mark Thomas
  • 37,131
  • 11
  • 74
  • 101
  • Very interesting! Could be cool with a gem inspired by https://github.com/pauldix/sax-machine but with the large file parsing scope – Niels Kristian Jan 16 '14 at 15:11
2

Use Nokogiri::XML::SAX::Parser (event-driven parser) and Nokogiri::XML::SAX::Document:

require 'nokogiri'

class IDCollector < Nokogiri::XML::SAX::Document
  attr :ids

  def initialize
    @ids = []
    @inside_id = false
  end

  def start_element(name, attrs)
    # NOTE: This is simplified. You need some kind of stack manipulations
    #                           (push in start_element / pop in end_element)
    #    to correctly pick `.//data/point/id` elements.
    @inside_id = true if name == 'id'
  end
  def end_element(name)
    @inside_id = false
  end

  def cdata_block string
    @ids << string if @inside_id
  end
end

collector = IDCollector.new
parser = Nokogiri::XML::SAX::Parser.new(collector)
parser.parse(File.open('large_file.xml'))
p collector.ids # => ["1371308", "1371308", "1371308"]

According to the documentation,

Nokogiri::XML::SAX::Parser: is a SAX style parser that reads its input as it deems necessary.

You can also use Nokogiri::XML::SAX::PushParser if you need more control over the file input.

falsetru
  • 357,413
  • 63
  • 732
  • 636
  • Hmm okay, that's one way - but is it really the simplest? I would have expected something a little less "complicated"... – Niels Kristian Jan 16 '14 at 14:27
  • @NielsKristian, I don't know simpler way. I wish other guys come up with better solutions. – falsetru Jan 16 '14 at 14:29
  • @NielsKristian, `end_element` could be dropped, if you replace `start_element` definition as `def start_element(name, attrs); @inside_id = name == 'id' end`. – falsetru Jan 16 '14 at 14:32
  • Makes sense. I'm wondering if there is any nice gems out there that would wrap this kind of functionality up in a more convenient DSL – Niels Kristian Jan 16 '14 at 14:37
  • @NielsKristian, If you don't mind to use Python, see [this answer](http://stackoverflow.com/questions/7171140/using-python-iterparse-for-large-xml-files) that use lxml. – falsetru Jan 16 '14 at 14:43
0

If you use jruby, you can take advantage of vtd-xml, which has the most efficient in memory model, 3~5x more efficient than DOM..

http://vtd-xml.sf.net

vtd-xml-author
  • 3,319
  • 4
  • 22
  • 30