0

I need to read XML files about 1 GB in size. My XML:

<products>
<product>
<categoryName>Kable i konwertery AV</categoryName>
<brandName>Belkin</brandName>
<productCode>AV10176bt1M-BLK</productCode>
<productId>5616488</productId>
<productFullName>Kabel Belkin Kabel HDMI Ultra HD High Speed 1m-AV10176bt1M-BLK</productFullName>
<productEan>0745883767465</productEan>
<productEuroPriceNetto>59.71</productEuroPriceNetto>
<productFrontendPriceNetto>258.54</productFrontendPriceNetto>
<productFastestSupplierQuantity>23</productFastestSupplierQuantity>
<deliveryEstimatedDays>2</deliveryEstimatedDays>
</product>
<product>
<categoryName>Telewizory</categoryName>
<brandName>Sony</brandName>
<productCode>KDL32WD757SAEP</productCode>
<productId>1005662</productId>
<productFullName>Telewizor Sony KDL-32WD757 SAEP</productFullName>
<productEan></productEan>
<productEuroPriceNetto>412.33</productEuroPriceNetto>
<productFrontendPriceNetto>1785.38</productFrontendPriceNetto>
<productFastestSupplierQuantity>11</productFastestSupplierQuantity>
<deliveryEstimatedDays>6</deliveryEstimatedDays>
</product>
<product>
<categoryName>Kuchnie i akcesoria</categoryName>
<brandName>Brimarex</brandName>
<productCode>1566287</productCode>
<productId>885156</productId>
<productFullName>Brimarex Drewniane owoce, Kiwi - 1566287</productFullName>
<productEan></productEan>
<productEuroPriceNetto>0.7</productEuroPriceNetto>
<productFrontendPriceNetto>3.05</productFrontendPriceNetto>
<productFastestSupplierQuantity>7</productFastestSupplierQuantity>
<deliveryEstimatedDays>3</deliveryEstimatedDays>
</product>
</products>

I use XML reader.

$reader = new XMLReader();
$reader->open($url);
$count = 0;

while($reader->read()) {
    if($reader->nodeType == XMLReader::ELEMENT)
        $nodeName = $reader->name;

    if(($reader->nodeType == XMLReader::TEXT || $reader->nodeType == XMLReader::CDATA)) {

        if ($nodeName == 'categoryName') $categoryName = $reader->value;
        if ($nodeName == 'brandName') $brandName = $reader->value;
        if ($nodeName == 'productCode') $productCode = $reader->value;
        if ($nodeName == 'productId') $productId = $reader->value;
        if ($nodeName == 'productFullName') $productFullName = $reader->value;
        if ($nodeName == 'productEan') $productEan = $reader->value;
        if ($nodeName == 'productEuroPriceNetto') $productEuroPriceNetto = $reader->value;
        if ($nodeName == 'productFastestSupplierQuantity') $productFastestSupplierQuantity = $reader->value;
        if ($nodeName == 'deliveryEstimatedDays') $deliveryEstimatedDays = $reader->value;
    }

    if($reader->nodeType == XMLReader::END_ELEMENT && $reader->name == 'product') {
        $count++;
    }
}
$reader->close();

All is working fine except one problem... When some value is missing, for example <productEan></productEan> in output I am getting a value from the previous, not empty tag till another tag which is not empty.

For example, if previous node is like in example <productEan>0745883767465</productEan> and another two <productEan></productEan> are empty in output array I getting same value, 0745883767465.

What is the right way to solve this problem? Or maybe some one have working solution...

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
K. B.
  • 1,388
  • 2
  • 13
  • 25
  • It may also be worth having a look at https://stackoverflow.com/questions/1835177/how-to-use-xmlreader-in-php which shows how to read in an entire product item which you can then process as a SimpleXML record ( so `$node->productEan`) – Nigel Ren Feb 21 '19 at 21:52
  • Code suggested by @Nick working fine with smallest xml. But with large XML, I getting out of memory error. So there are issue now... – K. B. Feb 21 '19 at 22:55

3 Answers3

1

Here's some code that will do what you want. It saves the value for each element when it encounters a TEXT or CDATA node, then stores it when it gets to END_ELEMENT. At that time the saved value is set to '', so that if no value is found for an element, it gets an empty string (this could be changed to null if you prefer). It also deals with self-closing tags for example <brandName /> with an isEmptyElement check when a ELEMENT node is found. It takes advantage of PHPs variable variables to avoid the long sequence of if ($nodename == ...) that you have in your code, but also uses an array to store the values for each product, which longer term I think is a better solution for your problem.

$reader = new XMLReader();
$reader->xml($xml);
$count = 0;
$this_value = '';
$products = array();
while($reader->read()) {
    switch ($reader->nodeType) {
        case XMLReader::ELEMENT:
            // deal with self-closing tags e.g. <productEan />
            if ($reader->isEmptyElement) {
                ${$reader->name} = '';
                $products[$count][$reader->name] = '';
            }
            break;
        case XMLReader::TEXT:
        case XMLReader::CDATA:
            // save the value for storage when we get to the end of the element
            $this_value = $reader->value;
            break;
        case XMLReader::END_ELEMENT:
            if ($reader->name == 'product') {
                $count++;
                print_r(array($categoryName, $brandName, $productCode, $productId, $productFullName, $productEan, $productEuroPriceNetto, $productFrontendPriceNetto, $productFastestSupplierQuantity, $deliveryEstimatedDays));
            }
            elseif ($reader->name != 'products') {
                ${$reader->name} = $this_value;
                $products[$count][$reader->name] = $this_value;
                // set this_value to a blank string to allow for empty tags
                $this_value = '';
            }
            break;
        case XMLReader::WHITESPACE:
        case XMLReader::SIGNIFICANT_WHITESPACE:
        default:
            // nothing to do
            break;
    }
}
$reader->close();
print_r($products);

I've omitted the output as it's quite long but you can see the code in operation in this demo on 3v4l.org.

Nick
  • 138,499
  • 22
  • 57
  • 95
  • worked fine some time and suddenly i get an error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 20480 bytes) in ... line `${$reader->name} = $this_value;`. I have increased memory size in php.ini up to 2048M I tried set it in my current php file `ini_set('memory_limit','2048M');` But nothing help... Where the problem ? – K. B. Feb 21 '19 at 21:25
  • @K.B. it sounds like your input data is too large so you will need to process the data inside the loop instead of storing it. So in the `if ($reader->name == 'product`)` block you should do all the processing of the data and then (if you are using an array), throw it away by setting `$products = array();` – Nick Feb 21 '19 at 21:37
  • Yes this XML up to 1 GB. In first try i have used `foreach($products as $product)` outside of your script. Then I have tried to move all stuff to `if ($reader->name == 'product)` block like you suggested, but it does not help, or maybe I miss something. On my local server script working, but on remote server does not work. I can give the link of this XML, maybe you can suggest solution for this issue... – K. B. Feb 21 '19 at 22:27
  • When you used the `foreach` did it run out of memory in the `while ($reader->read)` loop or the `foreach` loop? – Nick Feb 21 '19 at 22:29
  • On this line `${$reader->name} = $this_value;` I getting out of memory error, when use foreach. – K. B. Feb 21 '19 at 22:47
  • 1
    @K.B. so that is still within the reading loop. When you changed to process the data in the loop, did you also get rid of the `$products` array? – Nick Feb 21 '19 at 23:06
  • Thanks. I got it. working... `results of memory usage: Number of items=852229 memory_get_usage() =858.8671875kb memory_get_usage(true) =2048kb memory_get_peak_usage() =859.265625kb memory_get_peak_usage(true) =2048kb`.. reduced 100 times Seems now no issues... – K. B. Feb 21 '19 at 23:27
  • 1
    @K.B. glad to hear it. Processing that volume of data can definitely be tricky. – Nick Feb 21 '19 at 23:32
  • Storing the data to process is actually the last thing you want to do when reading a large file. The whole point of processing it one at a time is that there may be so much data you run out of memory - that's why I suggested to process the data when you get to the end of the `` element. – Nigel Ren Feb 22 '19 at 07:25
  • @NigelRen you are absolutely right. I should have paid more attention to that up front but I am used to working with much larger datasets and machines. – Nick Feb 23 '19 at 01:33
1

If instead of using individual values, you store the values in an array of details, you can blank the array out once you have processed each element...

$reader->open($url);
$count = 0;

$data = [];
while($reader->read()) {
    if($reader->nodeType == XMLReader::ELEMENT)
        $nodeName = $reader->name;

        if(($reader->nodeType == XMLReader::TEXT || $reader->nodeType == XMLReader::CDATA)) {
            $data[$nodeName] = $reader->value;
        }

        if($reader->nodeType == XMLReader::END_ELEMENT && $reader->name == 'product') {
            // Process data
            echo ($data['productEan']??"Empty").PHP_EOL;
            // Reset
            $data = [];
            $count++;
        }
}
$reader->close();

which with your test data gives...

0745883767465
Empty
Empty
Nigel Ren
  • 56,122
  • 11
  • 43
  • 55
0

Reset all variables on each loop. It seems that if you do not assign any value to it, it is getting the previous assigned value.

<?php 
while($reader->read()) {
    $categoryName = 
    $brandName = 
    $productCode = 
    $productId = 
    $productFullName = 
    $productEan = 
    $productEuroPriceNetto = 
    $productFastestSupplierQuantity = 
    $deliveryEstimatedDays = '';
//... code
}
?>
caiovisk
  • 3,667
  • 1
  • 12
  • 18