I have a 5GB+ XML file that I want to parse into a MySQL database. I currently have a Ruby script that uses a Nokogiri SAX parser to insert every new book into the database, but this method is very slow since it inserts one by one. I need to figure out a way to parse the large file with multiple concurrent threads.
I was thinking I could split up the file into multiple files and multiple scripts would work on each subfile. Or have the script send each item to a background job for inserting into the database. Maybe using delayed_job, resque or sidekiq.
<?xml version="1.0"?>
<ibrary>
<NAME>cool name</NAME>
<book ISBN="11342343">
<title>To Kill A Mockingbird</title>
<description>book desc</description>
<author>Harper Lee</author>
</book>
<book ISBN="989894781234">
<title>Catcher in the Rye</title>
<description>another description</description>
<author>J. D. Salinger</author>
</book>
</library>
Does anyone have experience with this? With the current script, it'll take a year to load the database.