0

I have a 50+ GB XML file which I initially tried to (man)handle with Nokogiri :)

Got killed: 9 - obviously :)

Now I'm into muddy Ruby threaded waters with this stab (at it):

#!/usr/bin/env ruby

def add_vehicle index, str
  IO.write "ess_#{index}.xml", str
  #file_name = "ess_#{index}.xml"
  #fd = File.new file_name, "w"
  #fd.write str
  #fd.close
  #puts file_name
end

begin

  record = []
  threads = []
  counter = 1
  file = File.new("../ess2.xml", "r")
  while (line = file.gets)
    case line
    when /<ns:Statistik/                           
      record = []
      record << line
    when /<\/ns:Statistik/                         
      record << line
      puts "file - %s" % counter
      threads << Thread.new { add_vehicle counter, record.join }
      counter += 1
    else
      record << line
    end
  end
  file.close
  threads.each { |thr| thr.join }
rescue => err
    puts "Exception: #{err}"
    err
end

Somehow this code 'skips' one or two files when writing the result files - hmmm!?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
walt_die
  • 580
  • 5
  • 20
  • Just curious. What is this file? I looked for the node, and found a danish list of motor parts. Or something. – Eric Duminil Dec 06 '16 at 21:03
  • Trying to slurp a big file will kill any language. Instead, when parsing XML you need to use a [SAX parser, which Nokogiri does implement](http://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri/XML/SAX). I'd suggest reading up on how to use that. – the Tin Man Dec 06 '16 at 23:39

2 Answers2

1

Okay, you have a problem because your file is huge, and you want to use multithreading.

Now have you. problemstwo

On a more serious note, I've had very good experience with this code.

It parsed 20GB xml files with almost no memory use.

Download the mentioned code, save it as xml_parser.rb and this script should work :

require_relative 'xml_parser.rb'

file = "../ess2.xml"

def add_vehicle index, str
  filename = "ess_#{index}.xml"
  File.open(filename,'w+'){|out| out.puts str}
  puts format("%s has been written with %d lines", filename, str.each_line.count)
end

i=0
Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
  for_element 'ns:Statistik' do
    i+=1
    add_vehicle(i,@node.outer_xml)
  end
end

#=> ess_1.xml has been written with 102 lines
#=> ess_2.xml has been written with 102 lines
#=> ...

It will take time, but it should work without error and without using much memory.

By the way, here is the reason why your code missed some files :

threads = []
counter = 1
threads << Thread.new { puts counter }
counter += 1
threads.each { |thr| thr.join }
#=> 2

threads = []
counter = 1
threads << Thread.new { puts counter }
sleep(1)
counter += 1
threads.each { |thr| thr.join }
#=> 1

counter += 1 was faster than the add_vehicle call. So your add_vehicle was often called with the wrong counter. With many millions node, some might get 0 offset, some might get 1 offset. When 2 add_vehicle are called with the same id, they overwrite each other, and a file is missing.

You have the same problem with record, with lines getting written in the wrong file.

Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
  • hi @eric-duminil - so good of you to chime in :) do you mean to say that the counter += 1 'messes' with the counter variable before Thread.new is setup and calls add_vehicle? – walt_die Dec 06 '16 at 19:59
  • @walt_die : exactly. It's called a race condition : http://stackoverflow.com/questions/34510/what-is-a-race-condition and well, it sucks. – Eric Duminil Dec 06 '16 at 20:44
  • your code had me moving on - which is great :) I've seen race conditions _en masse_ in my career; I never actually planned to be the culprit myself :) Here's to _by reference_ perhaps your DSL could marry https://github.com/ohler55/ox and have a phenomenally fast and monumentally elegant _baby_ :D – walt_die Dec 07 '16 at 10:12
1

Perhabs you should try to synchronize counter += 1 with Mutex. For example: @lock = Mutex.new @counter = 0 def add_vehicle str @lock.synchronize do @counter += 1 IO.write "ess_#{@counter}.xml", str end end Mutex implements a simple semaphore that can be used to coordinate access to shared data from multiple concurrent threads.

Or you can go another way from the start and use Ox. It is way faster than Nokogiri, take a look on a comparison. For a huge files Ox::Sax

sig
  • 267
  • 1
  • 10