3

I have a 5GB+ XML file that I want to parse into a MySQL database. I currently have a Ruby script that uses a Nokogiri SAX parser to insert every new book into the database, but this method is very slow since it inserts one by one. I need to figure out a way to parse the large file with multiple concurrent threads.

I was thinking I could split up the file into multiple files and multiple scripts would work on each subfile. Or have the script send each item to a background job for inserting into the database. Maybe using delayed_job, resque or sidekiq.

<?xml version="1.0"?>
<ibrary>
  <NAME>cool name</NAME>
  <book ISBN="11342343">
    <title>To Kill A Mockingbird</title>
    <description>book desc</description>
    <author>Harper Lee</author>
  </book>
  <book ISBN="989894781234">
    <title>Catcher in the Rye</title>
    <description>another description</description>
    <author>J. D. Salinger</author>
  </book>
</library>

Does anyone have experience with this? With the current script, it'll take a year to load the database.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Jeff Locke
  • 617
  • 1
  • 6
  • 17
  • Sorry, I thought it was clear. The DB insert is the bottleneck. The SAX parser recognizes and object, then waits for the DB insert to finish before moving on to the next. Reading the XML file alone with no DB inserts does not take long. Maybe 30 min? Reading the file quickly is definitely not an issue – Jeff Locke Jan 10 '15 at 12:54
  • Try to remove index(es) before inserting. – Yevgeniy Anfilofyev Jan 10 '15 at 12:58
  • 2
    Have you tried just grouping the inserts into larger batches? `insert into books values ('title1', 'desc1'), ('title2', 'desc2'), ...;`? Doing multi-inserts should be much faster than one at a time. – Casper Jan 10 '15 at 13:41
  • Sidekiq can be run asynchronously but if you have 25 workers and you'll want to be able to turn up your connection pool to 25 too. – Anthony Jan 10 '15 at 13:43
  • @Casper I'm using Rails' ActiveRecord so I can't do multi-inserts (afaik). Also, I am aggregating more than one data source so I will need a method that is a little more flexible. – Jeff Locke Jan 10 '15 at 14:05
  • @Anthony thanks for the tip. I haven't used sidekiq before. Is the default limit of 25 usually good enough? – Jeff Locke Jan 10 '15 at 14:06
  • 1
    You will never solve the speed issue if you don't convert the code to insert multiple records per request: http://www.techfounder.net/2009/05/14/multiple-row-operations-in-mysql-php/ . See here how to do it with ActiveRecord: http://stackoverflow.com/questions/15812377/inserting-multiple-records-at-a-time-into-model – Casper Jan 10 '15 at 14:10
  • More discussion on this issue: http://kvz.io/blog/2009/03/31/improve-mysql-insert-performance/ – Casper Jan 10 '15 at 14:14
  • The Nokogiri tag is a red-herring. The problem isn't Nokogiri related, so it's incidental that it's part of the code and, as a result, it probably shouldn't be mentioned in the title or the tags. – the Tin Man Jan 11 '15 at 20:42

1 Answers1

1

This sounds like the perfect job for a producer/consumer queue. You only want one thread parsing the XML - but as it parses items (presumably converting them into some object type ready for insertion) it can put the converted objects onto a queue that multiple threads are reading from. Each consumer thread would just block on the queue until either the queue is "done" (i.e. the producer says there won't be any more data) or there's an item in the queue - in which case it processes it (adding the item to the database) and then goes back to waiting for data. You'll want to experiment with how many consumer threads gives you the maximum throughput - it will depend on various considerations, mostly around how your database is configured and what your connection to it is like.

I don't know anything about threading in Ruby so I can't give you sample code, but I'm sure there must be a good standard producer/consumer queue available, and the rest should be reasonably straightforward.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Thanks Jon. That's one of the approaches I was thinking about. The ruby versions of consumer queues are typically delayed_job, resque and/or sidekiq. Just don't have much experience with them so that's why I asked here. I plan on running the job on AWS. Any chance you have general experience with running more threads than cores? i.e. Is it normal to run 10 threads on a 4 core instance? – Jeff Locke Jan 10 '15 at 12:57
  • @Jeff: If the threads are IO bound (and you can't avoid that with asynchronous calls, for example) then yes, it makes perfect sense to have more cores than threads. – Jon Skeet Jan 10 '15 at 13:10
  • 2
    Ruby's [Queue](http://www.ruby-doc.org/core-2.2.0/Queue.html) documentation has a simple example of using a producer with consumers. Using that, plus the Nokogiri code and an ORM like [Sequel](http://sequel.jeremyevans.net), would make it very easy to put together a solution. – the Tin Man Jan 11 '15 at 20:38