Performance bulk-loading data from an XML file to MySQL

Question

Should an import of 80GB's of XML data into MySQL take more than 5 days to complete?

I'm currently importing an XML file that is roughly 80GB in size, the code I'm using is in this gist and while everything is working properly it's been running for almost 5 straight days and its not even close to being done ...

The average table size is roughly:

Data size: 4.5GB
Index size: 3.2GB
Avg. Row Length: 245
Number Rows: 20,000,000

Let me know if more info is needed!

Server Specs:

Note this is a linode VPS

Intel Xeon Processor L5520 - Quad Core - 2.27GHZ 4GB Total Ram

XML Sample

https://gist.github.com/2510267

Thanks!

After researching more regarding this matter this seems to be average, I found this answer which describes ways to improve the import rate.

Have you tried profiling your code to see where the time is being spent? — eggyal, Apr 27 '12 at 14:21
You might try altering the transaction log so it doesn't bog things down: http://stackoverflow.com/questions/996403/disable-transaction-log — Steve Wellens, Apr 27 '12 at 14:23
did you try your code on a small test to make sure it works fine? — Tamer Shlash, Apr 27 '12 at 14:24
No I havent but with that said once I get the import to run through the entire file without problems this is something I will be doing since I am in no rush to stop the current import from running. I'm just more interested in knowing if this is normal. — Nick, Apr 27 '12 at 14:25
I tested this like crazy on smaller imports which were roughly 50MB in size and it took less than 5 seconds to import, I also know it's working because I can go into MySQL and see the data continuously being imported and watch the import on top — Nick, Apr 27 '12 at 14:30
If the script is working correctly and just takes too long to finish, you should profile it with inputs of different size to see where the time is spent. — Krab, Apr 27 '12 at 15:20
Could you try posting your code and some example XML. It might help in better diagnosing your exact issue. Perhaps even machine specs. On the surface 5+ days doesn't sound suspicious, but performance could likely be improved **EDIT** sorry missed the gist. — matchew, Apr 27 '12 at 15:25

Charles Duffy · Accepted Answer · 2012-04-27T15:56:35.530

One thing which will help a great deal is to commit less frequently, rather than once-per-row. I would suggest starting with one commit per several hundred rows, and tuning from there.

Also, the thing you're doing right now where you do an existence check -- dump that; it's greatly increasing the number of queries you need to run. Instead, use ON DUPLICATE KEY UPDATE (a MySQL extension, not standards-compliant) to make a duplicate INSERT automatically do the right thing.

Finally, consider building your tool to convert from XML into a textual form suitable for use with the mysqlimport tool, and using that bulk loader instead. This will cleanly separate the time needed for XML parsing from the time needed for database ingestion, and also speed the database import itself by using tools designed for the purpose (rather than INSERT or UPDATE commands, mysqlimport uses a specialized LOAD DATA INFILE extension).

Ah a bug, thanks for that! The existence check should be excluded based on the import type which was mistakenly pruned from the latest update to the tool, I will also look into mysqlimport tool. — Nick, Apr 27 '12 at 16:10
Marking this as accepted since found a bug and provided a helpful hint to improve performance :) — Nick, May 01 '12 at 15:22

score 0 · Answer 2 · answered Apr 27 '12 at 15:12

This is (probably) unrelated to your speed problem but I would suggest double checking whether the behaviour of iterparse fits with your logic. At the point the start event happens it may or may not have loaded the text value of the node (depending on whether or not that happened to fit within the chunk of data it parsed) and so you can get some rather random behaviour.

matchew · Answer 3 · 2012-04-27T15:27:22.873

0

I have 3 quick suggesstions to make without seeing your code After attempting something similiar

optimize your code for high performance High-performance XML parsing in Python with lxml is a great article to look at.
look into pypy
rewrite your code to take advantage of multiple cpu's which python will not do natively

Doing these things greatly improved the speed of a similar project I worked on. Perhaps if you had posted some code and example xml I could offer a more in depth solution. (edit, sorry missed the gist...)

edited Apr 27 '12 at 15:27

answered Apr 27 '12 at 15:21

matchew

19,195
5
44
48

The code is posted in a gist which I referenced a link to in the second sentence of the question. I have already read article #1 and the system is running on a quad core which is evenly distributing the load of mysql/python fairly evenly already so would rewriting to take advantage of multiple cores really help? – Nick Apr 27 '12 at 15:27

Performance bulk-loading data from an XML file to MySQL

3 Answers3