0

I have large XML file (400 MB) and I need update it daily. For every main element I use SELECT + INSERT/UPDATE query into database. When I run script, it process 26 main elements per minute but it slows - after 500 main elements, it is much slower (10 elements per minute).

    $xml_reader = new XMLReader;
$xml_reader->open("feed.xml");


// move the pointer to the first product
while ($xml_reader->read() && $xml_reader->name != 'SHOPITEM');

// loop through the products
while ($xml_reader->name == 'SHOPITEM')
            {
            // load the current xml element into simplexml and we’re off and running!
            $feed = simplexml_load_string($xml_reader->readOuterXML());

            // now you can use your simpleXML object ($xml).
            //e.g. $feed->PRODUCTNO

            //SELECT, UPDATE/INSERT HERE
    }

    // move the pointer to the next product
    $xml_reader->next('SHOPITEM');
}

// don’t forget to close the file
$xml_reader->close();

This is the XML:

<?xml version="1.0" encoding="utf-8"?>
<SHOP>
    <SHOPITEM> 
        <ITEM_ID>2600000394161</ITEM_ID> 
        (+ 15 more elements like this) 
        <PARAM>
            <PARAM_NAME><![CDATA[some data here]]></PARAM_NAME> 
            <VAL><![CDATA[some data here]]></VAL> 
        </PARAM> 
        (+ 10 more elements like this) 
    </SHOPITEM> 
    (lot of shopitems here) 
</SHOP>

I can't use SimpleXML due to my RAM. Is there any faster PHP XML parser or what way does it big sites (e.g. price compare sites)? Better HW? My CPU is on 10% and RAM on 80% when XML processing.

michi
  • 6,565
  • 4
  • 33
  • 56
JackDavis
  • 117
  • 1
  • 4
  • 17
  • If your XML has a fixed structure, it can only be faster to parse it manually. XML Parsers are made to be multipurpose, they'll use more CPU and RAM that you actually need. Can you post an example or your XML Structure ? – Mouradif Jul 29 '15 at 13:57
  • sounds like you should use an other data-store then a single XML file. at least split into many files, or more likely, just use a DB. – hoijui Jul 29 '15 at 13:58
  • @KiJéy Hello, thank you for you comment. Structure is simple: ` 2600000394161 (+ 15 more elements like this) <![CDATA[some data here]]> <![CDATA[some data here]]> (+ 10 more elements like this) (lot of shopitems here) ` – JackDavis Jul 29 '15 at 14:07
  • 1
    if you rewrite the XML to a SQLite db, it will go MUCH faster! food for thought... – hanshenrik Jul 29 '15 at 17:08
  • 1
    i had a problem with 2GB+ XML file that took wayy too much time to edit.. which i had to do daily. when it was re-written as a SQLite file instead, it went from several minutes, to mere seconds! (and the save file went from 2GB+, to 10MB or so, amazing) – hanshenrik Jul 29 '15 at 17:08
  • 1
    also, i doubt your CPU is on 10% for XML processing.. its probably "100% (minus harddrive IO) of 1 core", and you have 8+ (logical) CPU cores ;) – hanshenrik Jul 29 '15 at 17:17
  • @hanshenrik, thank you for your idea! Converting into SQL query is the way I will use. – JackDavis Jul 29 '15 at 18:15
  • @JackDavis ok, tips: use PDO, and use ->prepare() and ->bindParam() and beginTransaction(), SQLite is MUCH faster under beginTransaction than without! also you might wanna take a look at this http://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite - also for max performance in PHP, use bindParam instead of bindValue/execute(array()) etc – hanshenrik Jul 29 '15 at 20:29

3 Answers3

1

You can speed up the XML parsing a little by expanding the node from XMLReader to a DOM element node, the element node can be converted into an SimpleXMLElement.

$xml_reader = new XMLReader;
$xml_reader->open("feed.xml");

$dom = new DOMDocument();
$xpath = new DOMXPath($dom);

// move the pointer to the first product
while ($xml_reader->read() && $xml_reader->name != 'SHOPITEM');

// loop through the products
while ($xml_reader->name == 'SHOPITEM') {
  // expand to DOM element
  $element = $xml_reader->expand($dom);

  // use Xpath with the $element as context node
  $itemId = $xpath->evaluate('string(ITEM_ID)', $element);

  // or import it into a SimpleXMLElement
  $item = simplexml_import_dom($element);

  //SELECT, UPDATE/INSERT HERE

  // move the pointer to the next product
  $xml_reader->next('SHOPITEM');
}
// don’t forget to close the file
$xml_reader->close();

To speed up the process you should collect the data and do mass inserts. Most DBMS support it this on way or another. Less SQL queries means less work on the database, but you need memory to collect the data. You will have to find a balance.

Another possibility is to generate a local file containing all the SQL statements and executing it using a console client. This can be really fast, but it is a potential security risk.

ThW
  • 19,120
  • 3
  • 22
  • 44
1

Consider using an XML database (e.g. eXist or BaseX). At this sort of size, it will be much more efficient.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

I cannot say whether or not the DOMDocument methods would be quicker than SimpleXML or not, but to use DOMDocument you would do something like below (untested):

$xmlfile='/path/to/feed.xml';

if( file_exists( $xmlfile ) ){
    libxml_use_internal_errors( TRUE );
    $dom = new DOMDocument('1.0','utf-8');
    $dom->validateOnParse=false;
    $dom->standalone=true;
    $dom->preserveWhiteSpace=true;
    $dom->strictErrorChecking=false;
    $dom->substituteEntities=false;
    $dom->recover=true;
    $dom->formatOutput=false;
    $dom->loadXML( $xmlfile );
    $parse_errs=serialize( libxml_get_last_error() );
    libxml_clear_errors();

    $xpath=new DOMXPath( $dom );

    $items = $dom->getELementsByTagName('SHOPITEM');

    foreach( $items as $node ){

        echo $node->nodeValue;

        /* Each node is likely to have children */
        $children=$xpath->query('PRODUCTNO',$node);
        foreach( $children as $child ){
            echo $child->nodeValue;
        }

        /*
            mysql cmds
            ----------
            select , update, insert
        */
    }

    $dom=null;
}
Professor Abronsius
  • 33,063
  • 5
  • 32
  • 46