4

I've been trying to parse a very large XML file with PHP and XMLReader, but can't seem to get the results I am looking for. Basically, I'm searching a ton of information, and if a contains a certain zipcode, I'd like to return that bit of XML, or keep searching until it finds that zipcode. Essentially, I'll be breaking this big file down into only a few small chunks, so instead of having to look at thousands or millions of groups of information, it would be maybe 10's or 20's.

Here's a bit of the XML with what I'd like to

//search through xml
<lineups country="USA">
//cache TX02217 as a variable
 <headend headendId="TX02217">
//cache Grande Gables at The Terrace as a variable
  <name>Grande Gables at The Terrace</name>
//cache Grande Communications as a variable
  <mso msoId="17541">Grande Communications</mso>
  <marketIds>
   <marketId type="DMA">635</marketId>
  </marketIds>
//check to see if any of the postal codes are equal to $pc variable that will be set in the php
  <postalCodes>
   <postalCode>11111</postalCode>
   <postalCode>22222</postalCode>
   <postalCode>33333</postalCode>
   <postalCode>78746</postalCode>
  </postalCodes>
//cache Austin to a variable
  <location>Austin</location>
  <lineup>
//cache all prgSvcID's to an array i.e. 20014, 10722
   <station prgSvcId="20014">
//cache all channels to an array i.e. 002, 003  
    <chan effDate="2006-01-16" tier="1">002</chan>
   </station>
   <station prgSvcId="10722">
    <chan effDate="2006-01-16" tier="1">003</chan>
   </station>
  </lineup>
  <areasServed>
   <area>
//cache community to a variable $community   
    <community>Thorndale</community>
    <county code="45331" size="D">Milam</county>
//cache state to a variable i.e. TX
    <state>TX</state>
   </area>
   <area>
    <community>Thrall</community>
    <county code="45491" size="B">Williamson</county>
    <state>TX</state>
   </area>
  </areasServed>
 </headend>

//if any of the postal codes matched $pc 
//echo back the xml from <headend> to </headend>

//if none of the postal codes matched $pc
//clear variables and move to next <headend>

 <headend>
 etc
 etc
 etc
 </headend>
 <headend>
 etc
 etc
 etc
 </headend>
 <headend>
 etc
 etc
 etc
 </headend> 
</lineups>

PHP:

<?php
$pc = "78746";
$xmlfile="myFile.xml";
$reader = new XMLReader();
$reader->open($xmlfile); 

while ($reader->read()) { 
//search to see if groups contain $pc and echo info
}

I know I'm making this harder than it should be but am a little overwhelmed trying to manipulate such a large file. Any help is appreciated.

user1129107
  • 205
  • 1
  • 8
  • 16
  • What are you actually looking for in that chunk of XML? XPath is your friend. You just want to see if any contains a predetermined value? – Matt Mar 11 '13 at 18:15
  • Sort of. If I search through this big file, and a chunk contains a predetermined zipcode, then I want to basically return that chunk. It will cut down the size of this huge file to like 2%. I will still be returning XML, but the amount I will have to reference will be drastically smaller. – user1129107 Mar 11 '13 at 18:21

2 Answers2

7

To gain more flexibility with XMLReader I normally create myself iterators that are able to work on the XMLReader object and provide the steps I need.

That starts with a simple iteration over all nodes over to the iteration over elements optionally with a specific name. Let's call the last one XMLElementIterator taking the reader and the element name as parameters.

In your scenario I then would create an iterator that returns a SimpleXMLElement for the current element, taking only the <headend> elements:

require('xmlreader-iterators.php'); // https://gist.github.com/hakre/5147685

class HeadendIterator extends XMLElementIterator {
    const ELEMENT_NAME = 'headend';

    public function __construct(XMLReader $reader) {
        parent::__construct($reader, self::ELEMENT_NAME);
    }

    /**
     * @return SimpleXMLElement
     */
    public function current() {
        return simplexml_load_string($this->reader->readOuterXml());
    }
}

Equipped with this iterator the rest of your job is mainly a piece of cake. First load the 10 gigabyte file:

$pc      = "78746";

$xmlfile = '../data/lineups.xml';
$reader  = new XMLReader();
$reader->open($xmlfile);

And then check if the <headend> element contains the information and if so, display the data / XML:

foreach (new HeadendIterator($reader) as $headend) {
    /* @var $headend SimpleXMLElement */
    if (!$headend->xpath("/*/postalCodes/postalCode[. = '$pc']")) {
        continue;
    }

    echo 'Found, name: ', $headend->name, "\n";
    echo "==========================================\n";
    $headend->asXML('php://stdout');
}

This does literally what you're trying to achieve: Iterate over the large document (which is memory-friendly) until you find the element(s) you're interested in. You then process on the concrete element and it's XML only; XMLReader::readOuterXml() is a fine tool here.

Exemplary output:

Found, name: Grande Gables at The Terrace
==========================================
<?xml version="1.0"?>
<headend headendId="TX02217">
        <name>Grande Gables at The Terrace</name>
        <mso msoId="17541">Grande Communications</mso>
        <marketIds>
            <marketId type="DMA">635</marketId>
        </marketIds>
        <postalCodes>
            <postalCode>11111</postalCode>
            <postalCode>22222</postalCode>
            <postalCode>33333</postalCode>
            <postalCode>78746</postalCode>
        </postalCodes>
        <location>Austin</location>
        <lineup>
            <station prgSvcId="20014">
                <chan effDate="2006-01-16" tier="1">002</chan>
            </station>
            <station prgSvcId="10722">
                <chan effDate="2006-01-16" tier="1">003</chan>
            </station>
        </lineup>
        <areasServed>
            <area>
                <community>Thorndale</community>
                <county code="45331" size="D">Milam</county>
                <state>TX</state>
            </area>
            <area>
                <community>Thrall</community>
                <county code="45491" size="B">Williamson</county>
                <state>TX</state>
            </area>
        </areasServed>
    </headend>
hakre
  • 193,403
  • 52
  • 435
  • 836
  • I think you nailed it. This is exactly what I'm trying to do. I'm not all that familiar with PHP and having trouble following your example, however. Can you simplify it just a bit more? I'll continue to try to understand it as is if you don't have the time. Thanks for the reply! – user1129107 Mar 12 '13 at 03:08
  • I coped your example. In the main php file I have include('iterator.php'); However, I am getting the following error: Fatal error: Class 'XMLElementIterator' not found in iterator.php – user1129107 Mar 12 '13 at 18:44
  • How to use just the parent `XMLElementIterator` class without create a new class? – secondman Aug 26 '13 at 17:03
  • @VinceKronlein: An improved variant of that class is available in the github repository. Converting the inner XML to SimpleXML is already available (even with a fallback for not so compatible, older PHP/libxml versions), and you can just use the `new` keyword and pass the name of the element next to the XMLReader object. - https://github.com/hakre/XMLReaderIterator – hakre Dec 11 '13 at 18:49
0

Edit: Oh you want to return the parent chunk? One moment.

Here's an example to pull out all of the postalCodes into an array.

http://codepad.org/kHss4MdV

<?php

$string='<lineups country="USA">
 <headend headendId="TX02217">
  <name>Grande Gables at The Terrace</name>
  <mso msoId="17541">Grande Communications</mso>
  <marketIds>
   <marketId type="DMA">635</marketId>
  </marketIds>
  <postalCodes>
   <postalCode>11111</postalCode>
   <postalCode>22222</postalCode>
   <postalCode>33333</postalCode>
   <postalCode>78746</postalCode>
  </postalCodes>
  <location>Austin</location>
  <lineup>
   <station prgSvcId="20014">
    <chan effDate="2006-01-16" tier="1">002</chan>
   </station>
   <station prgSvcId="10722">
    <chan effDate="2006-01-16" tier="1">003</chan>
   </station>
  </lineup>
  <areasServed>
   <area>
    <community>Thorndale</community>
    <county code="45331" size="D">Milam</county>
    <state>TX</state>
   </area>
   <area>
    <community>Thrall</community>
    <county code="45491" size="B">Williamson</county>
    <state>TX</state>
   </area>
  </areasServed>
 </headend></lineups>';

$dom = new DOMDocument();
$dom->loadXML($string);

$xpath = new DOMXPath($dom);
$elements= $xpath->query('//lineups/headend/postalCodes/*[text()=78746]');

if (!is_null($elements)) {
  foreach ($elements as $element) {
    echo "<br/>[". $element->nodeName. "]";

    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
      echo $node->nodeValue. "\n";
    }
  }
}

Outputs:

<br/>[postalCode]78746
Matt
  • 5,315
  • 1
  • 30
  • 57
  • Would it be as simple as `if(count($nodes)){ echo $string; }` instead of the foreach or is there more to it? – Matt Mar 11 '13 at 18:37
  • Because the file is so big (possibly a gig or more) I think the best way to tackle it would be node by node with XMLReader. I can't preload the file because it's so big. I don't want to print out the zipcodes so much as the other information contained in the . I want to see if a chunk contains a certain zipcode, and if it does, I want to echo out the entire chunk. – user1129107 Mar 11 '13 at 18:46