3

I currently have a bit of PHP code that basically pulls in data from an xml file and creates simple xml object using $products = new SimpleXMLElement($xmlString); I then loop over this code with a for loop within which I set the product details for each product in the XML document. Then it is saved to a mySql database.

Whilst running this script the products added reduce in frequency until they eventually stop before reaching the maximum. I have tried running garbage collection in intervals, to no avail. As well as unsetting various variables which doesn't seem to work .

Part of the code is shown below:

<?php
$servername = "localhost";
$username = "database.database";
$password = "demwke";
$database = "databasename";
$conn = new mysqli($servername, $username, $password, $database);

$file = "large.xml";
$xmlString = file_get_contents($file);
$products = new SimpleXMLElement($xmlString);
unset($xmlString, $file);
$total = count($products->datafeed[0]);

echo 'Starting<br><br>';

for($i=0;$i<$total;$i++){
    $id = $products->datafeed->prod[$i]['id'];
etc etc
    $sql = "INSERT INTO products (id, name, uid, cat, prodName, brand, desc, link, imgurl, price, subcat) VALUES ('$id', '$store', '$storeuid', '$category', '$prodName', '$brand', '$prodDesc', '$link', '$image', '$price', '$subCategory')";
}
echo '<br>Finished';
?>

The php variables are all defined using a similar line as with $id but removed to make easier reading.

Any ideas on what I can do/read to get this completing? The time taken doesn't really matter to me as long as it eventually completes.

LSerni
  • 55,617
  • 10
  • 65
  • 107
Adam Moseley
  • 199
  • 1
  • 10
  • 2
    Could you explain more what you mean by "reduce in frequency until they eventually stop?" Maybe add a segment of your XML structure to illustrate? – Chris Brendel Apr 20 '15 at 18:55
  • I have another page that I use to check the total rows in the database. After the first 5 seconds it is around 4000 then after another 5 seconds abut 2000 have been added since. This then reduces until it is just around 10 per second. – Adam Moseley Apr 20 '15 at 18:57
  • 1
    Probable dupe: http://stackoverflow.com/questions/18518602/stream-parse-4-gb-xml-file-in-php – Marc B Apr 20 '15 at 18:58
  • 3
    Your database is slow with inserts. You can try to pack together inserts into transactions (e.g. 500 at once), check your keyspace (e.g. see http://stackoverflow.com/q/16977898/367456), optimize through prepared statements and more importantly metric first if the mysql operation slows down or the XML operation. I suspect it's the mysql insert operation. Also study the Mysql manual on the INSERT INTO page and do some research. – hakre Apr 20 '15 at 19:52
  • @AdamMoseley check additional sentence in my answer. – hakki Apr 20 '15 at 19:58
  • 1
    @AdamMoseley Metric before you change your code at the wrong places only because of some guessing answers. The better you formulate the question, the better answers you'll get. – hakre Apr 20 '15 at 20:48

5 Answers5

2

You could try increasing the memory limit. If that is not an option and you just need to get it done once, I would personally just chunk it up and process like 5k values at a time.

<?php
$servername = "localhost";
$username = "database.database";
$password = "demwke";
$database = "databasename";
$conn = new mysqli($servername, $username, $password, $database);

$file = "large.xml";
$xmlString = file_get_contents($file);
$products = new SimpleXMLElement($xmlString);
unset($xmlString, $file);

$total = count($products->datafeed[0]);

//get your starting value for this iteration
$start = isset($_GET['start'])?(int)$_GET['start']:0;

//determine when to stop
//process no more than 5k at a time
$step = 5000;
//where to stop, either after our step (max) or the end
$limit = min($start+$step, $total);

echo 'Starting<br><br>';

//modified loop so $i starts at our start value and stops at our $limit for this load.
for($i=$start;$i<$limit;$i++){
    $id = $products->datafeed->prod[$i]['id'];
etc etc
    $sql = "INSERT INTO products (id, name, uid, cat, prodName, brand, desc, link, imgurl, price, subcat) VALUES ('$id', '$store', '$storeuid', '$category', '$prodName', '$brand', '$prodDesc', '$link', '$image', '$price', '$subCategory')";
}

if($limit >= $total){
    echo '<br>Finished';
} else {
    echo<<<HTML
<html><head>
<meta http-equiv="refresh" content="2;URL=?start={$limit}">
</head><body>
Done processing {$start} through {$limit}. Moving on to next set in 2 seconds.
</body><html>
HTML;
}
?>

As long as this isn't something you are having a user load (like a standard visitor to your site) there shouldn't be an issue.

Another option, have you tried properly preparing/binding your queries?

Jonathan Kuhn
  • 15,279
  • 3
  • 32
  • 43
2

Update: never use indexes with SimpleXML unless you have really few objects. Use foreach instead.:

// Before, with [index]:
for ($i=0;$i<$total;$i++) {
    $id = $products->datafeed->prod[$i]['id'];
    ...

// After, with foreach():
$i = 0;
foreach ($products->datafeed->prod as $prod) {
    $i++; // Remove if you don't actually need $i
    $id = $prod['id'];
    ...

In general, ...->node[$i] will access the array node[] and read it all up to the desired index, so that iterating the node array is not o(N), but o(N2). There is no workaround, because there is no guarantee that when you access item K, you've just accessed item K-1 (and so on recursively). foreach saves the pointer and thus works in o(N).

For the same reason, it might be advantageous to iterate with foreach the whole array even if you really need only few, known items (unless they're few and very near the beginning of the array):

    $a[0] = $products->datafeed->prod[15]['id'];
    ...
    $a[35] = $products->datafeed->prod[1293]['id'];

// After, with foreach():
$want = [ 15, ... 1293 ];
$i = 0;
foreach ($products->datafeed->prod as $prod) {
    if (!in_array(++$i, $want)) {
        continue;
    }
    $a[] = $prod['id'];
}

You should first verify whether the increasing delay is caused by MySQLi or by XML processing. You can remove (comment out) the SQL query execution, and nothing else, from the cycle, to verify whether the speed (granted it will now be much higher... :-) ) remains now constant, or shows the same decrease.

I suspect that the XML processing is the culprit, in here:

for($i=0;$i<$total;$i++){
    $id = $products->datafeed->prod[$i]['id'];

...where you access an index which is farther and farther into a SimpleXMLObject. This might suffer from the problem of Schlemiel the Painter.

The straight answer to your question, "how do I get the loop to complete, no matter the time", is "increase memory limit and max execution time".

To improve performances, you can use a different interface into the feed object:

$i = -1;
foreach ($products->datafeed->prod as $prod) {
    $i++;
    $id = $prod['id'];
    ...
}

Experimenting

I use this small program to read a large XML and iterate its content:

// Stage 1. Create a large XML.
$xmlString = '<?xml version="1.0" encoding="UTF-8" ?>';
$xmlString .= '<content><package>';
for ($i = 0; $i < 100000; $i++) {
    $xmlString .=  "<entry><id>{$i}</id><text>The quick brown fox did what you would expect</text></entry>";
}
$xmlString .= '</package></content>';

// Stage 2. Load the XML.
$xml    = new SimpleXMLElement($xmlString);

$tick   = microtime(true);
for ($i = 0; $i < 100000; $i++) {
    $id = $xml->package->entry[$i]->id;
    if (0 === ($id % 5000)) {
        $t = microtime(true) - $tick;
        print date("H:i:s") . " id = {$id} at {$t}\n";
        $tick = microtime(true);
    }
}

After generating the XML, a cycle parses it and prints how much does it take to iterate 5000 elements. To verify it is indeed the time delta, the date is also printed. The delta should be approximately the difference in time between the timestamps.

21:22:35 id = 0 at 2.7894973754883E-5
21:22:35 id = 5000 at 0.38135695457458
21:22:38 id = 10000 at 2.9452259540558
21:22:44 id = 15000 at 5.7002019882202
21:22:52 id = 20000 at 8.0867099761963
21:23:02 id = 25000 at 10.477082967758
21:23:15 id = 30000 at 12.81209897995
21:23:30 id = 35000 at 15.120756149292

So that's what happens: processing the XML array goes slower and slower.

This is mostly the same program using foreach:

// Stage 1. Create a large XML.
$xmlString = '<?xml version="1.0" encoding="UTF-8" ?>';
$xmlString .= '<content><package>';
for ($i = 0; $i < 100000; $i++) {
    $xmlString .=  "<entry><id>{$i}</id><text>The quick brown fox did ENTRY {$i}.</text></entry>";
}
$xmlString .= '</package></content>';

// Stage 2. Load the XML.
$xml    = new SimpleXMLElement($xmlString);

$i      = 0;
$tick   = microtime(true);
foreach ($xml->package->entry as $data) {
    // $id = $xml->package->entry[$i]->id;
    $id = $data->id;
    $i++;
    if (0 === ($id % 5000)) {
        $t = microtime(true) - $tick;
        print date("H:i:s") . " id = {$id} at {$t} ({$data->text})\n";
        $tick = microtime(true);
    }
}

The times seem to be constant now... I say "seem" because they appear to have decreased by a factor of about ten thousand, and I have some difficulties in getting reliable measurements.

(And no, I had no idea. I probably never used indexes with large XML arrays).

21:33:42 id = 0 at 3.0994415283203E-5 (The quick brown fox did ENTRY 0.)
21:33:42 id = 5000 at 0.0065329074859619 (The quick brown fox did ENTRY 5000.)
...
21:33:42 id = 95000 at 0.0065121650695801 (The quick brown fox did ENTRY 95000.)
LSerni
  • 55,617
  • 10
  • 65
  • 107
2

Here are two issues to address:

Memory

At the moment you're reading the full file into memory with file_get_contents() and parse it into a object structure with SimpleXML. Both actions load the complete file into memory.

A better solution is to use XMLReader:

$reader = new XMLReader;
$reader->open($file);
$dom = new DOMDocument;
$xpath = new DOMXpath($dom);

// look for the first product element
while ($reader->read() && $reader->localName !== 'product') {
  continue;
}

// while you have an product element
while ($reader->localName === 'product') {
  // expand product element to a DOM node
  $node = $reader->expand($dom);
  // use XPath to fetch values from the node
  var_dump(
    $xpath->evaluate('string(@category)', $node),
    $xpath->evaluate('string(name)', $node),
    $xpath->evaluate('number(price)', $node)
  );
  // move to the next product sibling
  $reader->next('product');
}

Performance

Working with a lot of data takes time, doing it in a serial way even more.

Moving the script to the command line can take care of timeouts. It might be possible increase the limit with `set_time_limit(), too.

Another option is to optimize the inserts, collect some records and combine them to a single insert. This reduces the roundtrips/work on the database server but consumes more memory. You will have to find a balance.

INSERT INTO table 
   (field1, field2) 
VALUES 
   (value1_1, value1_2), 
   (value2_1, value2_2), ...

You could even write the SQL into a file and use the mysql command line tool to insert the records. This is really fast but hast security implications, because you need to use exec().

ThW
  • 19,120
  • 3
  • 22
  • 44
0

Can you please check the below 2 steps may it helps you.

1) Increase the default PHP execution time from 30 sec to a bigger one.
   ini_set('max_execution_time', 300000);

2) If fails please try to execute your code though cron job/back end.
0

I had same issue before.

Explode your large xml file to smaller files like file1, file2, file3 than process them.

You can explode your xml with a text editor which it can open large files. Don't waste your time with php while exploding your file.

edit: I find an answer for huge xml files. I think that is the best answer for this purpose. Parsing Huge XML Files in PHP

Community
  • 1
  • 1
hakki
  • 6,181
  • 6
  • 62
  • 106
  • Depending on how complicated the XML structure is, this might not be such a simple fix which is why I suggested to just use php to track your position and continue where you left of on later page loads. – Jonathan Kuhn Apr 20 '15 at 19:52
  • Are you sure it's the XML file? And no need to cut with a texteditor, you can take a pulling parser like XMLReader and process one main element after the other - in case the XML file is really too large (from the infrormation given in the question, the XML likely is not the problem here). – hakre Apr 20 '15 at 20:47