2

I have this problem in terms of processing time of a large xml files. By large, i mean 600MB on the average. Currently, It takes about 50 - 60 minutes to parse and insert the data into a database. I would like to ask for suggestions on how can I improve the processing time? Like goind down to 20 minutes.

Because with the current time it will take me 2.5 months to populate the database with the content from the xml. By the way I have 3000+ xml files with the average of 600mb. And my php script in command line thru cron job.

I have also read other questions like the one below, but I have not found any idea yet. What is the fastest XML parser in PHP?

I see that some have parsed files up to 2GB. I wonder how long are the processing time.

I hope you guys could lend your help. It would be much appreciated. Thanks.

I have this code:

$handler = $this;
$parser = xml_parser_create('UTF-8');
xml_set_object($parser, $handler);
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, false);
xml_set_element_handler($parser, "startElement", "endElement");
xml_set_character_data_handler($parser, "cdata");

$fp = fopen($xmlfile, 'r');

while (!feof($fp)) {
    while (($data = fread($fp, 71680))){

    }
}

I put first the parse data in a temporary array. My mysql insert commands are inside the endElement function. There is a specific closing tag to trigger my insert command to the database.

Thanks for the response....

Community
  • 1
  • 1
Jan Mark
  • 147
  • 1
  • 2
  • 10
  • 3
    What is your current code?! Without knowing that it's hard to suggest improvements. – deceze Aug 16 '12 at 09:07
  • Ok, here is an advice from inexperienced guy: "have you tried regular expressions?" I would like to know if regex is optimal for this task, put aside _easiness of coding_. Any experts? – Prasanth Aug 16 '12 at 09:20
  • @goldenparrot http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html :D – Ezequiel Muns Aug 16 '12 at 09:21
  • @EzequielMuns Heyyy, you can always regex clean, strictly formatted xml! That's what the 2nd *answer said there. Right? 800+ upvotes? Btw, nice read :) – Prasanth Aug 16 '12 at 09:33
  • I think reqex will not work since it is large file. And I have to chunk to file or stream the content. Since it will generate a memory error if it will be loaded to the memory at once. – Jan Mark Aug 16 '12 at 09:59
  • @goldenparrot RegExp could be useful here if the Jan only needs certain part of the XML. Without more info on the nature of the job it's hard to say. – cleong Aug 16 '12 at 10:39

3 Answers3

3

Without seeing any code, the very first thing I have to suggest is NOT to use either DOM or SimpleXMLElement as these load the whole thing into memory.

You need to use a stream parser like XMLReader.


EDIT:

Since you are already using a stream parser, you aren't going to get huge gains from changing parsers (I honestly don't know the difference in speed between XML Parser and XMLReader, since the latter uses libxml, it may be better but probably not worth it).

Next thing to look at is whether you're doing anything silly in your code; for that we'd need to see a more substantial overview of how you've implemented this.

You say you are putting data in a temporary array and calling MySQL insert once you reach a closing tag. Are you using prepared statements? Are you using transactions to do multiple inserts in bulk?

The right way to get at your bottleneck though is to run a profiler over your code. My favourite tool for the job is xhProf with XHGui. This will tell you what functions are running, how many times, for how long and how much memory they consume (and can display it all in a nice call-graph, very useful).

Use the instructions on that GitHub's README. Here's a tutorial and another useful tutorial (bear in mind this last one is for the profiler without the XHGui extensions that I linked to).

Ezequiel Muns
  • 7,492
  • 33
  • 57
2

You only seem to need to parse and read the data and not edit the XML. With this mind, I would say using a SAX parser is the easier and faster way to do this.

SAX is an approach to parse XML documents, but not to validate them. The good thing is that you can use it with both PHP 4 and PHP 5 with no changes. In PHP 4, the SAX parsing is already available on all platforms, so no separate installation is necessary.

You basically define a function to be run when a start element is found and another to be run when an end element is found (you can also use one for attributes). And then you do whatever you want with the parsed data.

Parsing XML with SAX

<?
function start_element($parser, $element_name, $element_attrs) {
  switch ($element_name) {
    case 'KEYWORDS':
       echo '<h1>Keywords</h1><ul>';
       break;
    case 'KEYWORD':
       echo '<li>';
       break;
  }
}

function end_element($parser, $element_name) {
  switch ($element_name) {
    case 'KEYWORDS':
       echo '</ul>';
       break;
    case 'KEYWORD':
       echo '</li>';
       break;
  }
}


function character_data($parser, $data) {
  echo htmlentities($data);
}

$parser = xml_parser_create();
xml_set_element_handler($parser, 'start_element', 'end_element');
xml_set_character_data_handler($parser, 'character_data');

$fp = fopen('keyword-data.xml', 'r')
    or die ("Cannot open keyword-data.xml!");


while ($data = fread($fp, 4096)) {
  xml_parse($parser, $data, feof($fp)) or
   die(sprintf('XML ERROR: %s at line %d',
        xml_error_string(xml_get_error_code($parser)),
        xml_get_current_line_number($parser)));
}


xml_parser_free($parser);   
?>

Source: I worked on parsing and processing large amounts of XML data. EDIT: Better example

EDIT: Well, apparently you are already using a Sax Parser. As long as you are actually processing the file in an event driven way (Not having any additional overhead) you should be at top performance in this department. I would say there is nothing you can do to increase the parsing performance. If you are having performance issues I would suggest looking at what you are doing in your code to find performance bottlenekcs (Try using a php profiler like this one ). If you post your code here, we could give it a look! Cheers!

Raul Martins
  • 385
  • 2
  • 6
  • I think my code is almost the same as yours. Using your suggested code, how long does it take to process a 600MB file? By the way, I thought I am using XMLReader. How do I know I'm using SAX or XMLReader? THanks. – Jan Mark Aug 16 '12 at 11:35
  • In PHP, `XML Parser` (http://www.php.net/manual/en/book.xml.php) is an implementation of SAX, an event-driven parser with a procedural interface. `XML Reader` (http://www.php.net/manual/en/book.xmlreader.php) is a pull parser with an Object Oriented-ish interface. You are using the former. – Ezequiel Muns Aug 17 '12 at 01:08
1

I have spent the last day or so tackling the same problem. I noticed that limiting the number of insert queries reduced the processing time quite significantly. You might have already done this but try collecting a batch of parsed data into a suitable data structure (I am using a simple array, but maybe a more suitable data structure could further reduce the cost?). Upon a collection of X sets insert the data in one go (INSERT INTO table_name (field_name) VALUES (set_1, set_2, set_n) )

Hope this helps anyone who might stumble upon this page. I am still working out other bottlenecks, if I find something new I will post it here.

captainspi
  • 445
  • 2
  • 15
  • Thanks. I have concern on batch insert in terms of duplicates. I need check before the insert queries. Any idea on this. I still experience the long processing time. – Jan Mark Oct 05 '12 at 03:52
  • Comparisons are very expensive. They will slow your script further. I would use UNIQUE indexes and handle duplicates in the back end by applying the said constraints for the right columns. Suppress errors for each query insertion by preceding the execute function with an @ sign. Example: @mysqli_query() This way your insertion works seamlessly and all duplicates are left out with the errors surpressed hence not breaking your script. – captainspi Oct 05 '12 at 17:43