0

I am using a combination of XMLReader and simpleXML to parse the Posts in a WordPress export file. I realize this is a little out of the norm but, its more of backup project, so we can easily pull up one of these articles if we need it in the futre. The WP site that they were on needs to come down.

The issue I am having is that some of the nodes in the XML file are empty or contain useless values (ie. Not full posts). I need to add some string length conditions but, I'm not sure how to check for each one.

<?php 

$path_to_xml_file = 'compress.zlib://wordpress.2011.xml.gz';


$reader = new XMLReader();
                $reader->open($path_to_xml_file);
                while($reader->read())
                {
                        if($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'item')
                        {
                                        $doc = new DOMDocument('1.0', 'UTF-8');
                                        $xml = simplexml_import_dom($doc->importNode($reader->expand(),true));
                                        //echo $xml->title; //or whatever

// Take care of the articles
$newcontent = $xml->children('http://purl.org/rss/1.0/modules/content/');
$contentString = $newcontent->encoded;
$titleString = $xml->title;

    echo '
    <div class="article-container" id="article-' .  $xml->title . '">
    <a href="#top" class="top-link">Back to the Top</a>
        <h2>' .  $xml->title . '</h2>
        <div class="articles">' . $newcontent->encoded . '</div>
    </div>';
                        }
                }

?>

I was able to successfully check this with just simpleXML but, it was too much of a memory hog all by itself. This was my simplexml code:

<?php 

    $url = 'wordpress.2011.xml.gz';
    $xml = new SimpleXMLElement("compress.zlib://$url", NULL, TRUE);

    foreach ($xml->item as $item) :

    $newcontent = $item->children('http://purl.org/rss/1.0/modules/content/');

    ?>

<?php
$contentString = $newcontent->encoded;
$titleString = $item->title;

if ((strlen($contentString) < 13) || (strlen($titleString) < 5))  {
    echo '';
} else {
    echo '
    <div class="article-container" id="article-' .  $item->title . '">
    <a href="#top" class="top-link">Back to the Top</a>
        <h2>' .  $item->title . '</h2>
        <div class="articles">' . $newcontent->encoded . '</div>
    </div>';
}
?>



 <?php endforeach; ?>

UPDATE

With Francis' help, it is working now. Here is the code:

<?php 

$path_to_xml_file = 'compress.zlib://wordpress.2011.xml.gz';

$reader = new XMLReader();
$reader->open($path_to_xml_file);
$contentNS = 'http://purl.org/rss/1.0/modules/content/';
while($reader->read()) {
    if($reader->nodeType == XMLReader::ELEMENT and $reader->name == 'item') {
        $doc = new DOMDocument('1.0','UTF-8');
        $xml = simplexml_import_dom($doc->importNode($reader->expand(), true));
        $titleString = (string) $xml->title;
        $contentString = (string) $xml->children($contentNS)->encoded;
        if (strlen($contentString) > 12 and strlen($titleString) > 4)  {
            // Be careful with your output escaping!
            // This below looks like it might be wrong:
            // - $titleString for an ID (use slug)
            // - $titleString not escaped
            // - $contentString should be escaped? not sure here.
            // Have you considered using XMLWriter()?
            echo '
<div class="article-container" id="article-' .  $titleString . '">
    <a href="#top" class="top-link">Back to the Top</a>
    <h2>' .  $titleString . '</h2>
    <div class="articles">' . $contentString . '</div>
</div>';
        } else {

        echo'';

        }

        $reader->next(); //skip the subtrees, go to next item sibling
        // we already expand()ed this so we don't need to walk it.
    }
}

?>
Batfan
  • 7,966
  • 6
  • 33
  • 53

1 Answers1

2

When you say $contentString = $newcontent->encoded, the type of $contentString is not string but SimpleXMLElement. Thus strlen() is returning something nonsensical.

You need to explicitly cast SimpleXMLElements to string to get the text value of the element:

$contentString = (string) $newcontent->encoded;

As an aside, you can simplify your DOM expansion and conversion to SimpleXMLElement by using the optional argument to XMLReader::expand():

$sxe = simplexml_import_dom($reader->expand(new DOMDocument('1.0','UTF-8')));

EDIT with a complete example of your first code block written to do what you want (I think?) As you can see all I did was take the inner loop from your second code example and put it in the inner loop in your first code example.

$reader = new XMLReader();
$reader->open($path_to_xml_file);
$contentNS = 'http://purl.org/rss/1.0/modules/content/';
while($reader->read()) {
    if($reader->nodeType == XMLReader::ELEMENT and $reader->name == 'item') {
        $xml = simplexml_import_dom($reader->expand(new DOMDocument('1.0', 'UTF-8')));
        $titleString = (string) $xml->title;
        $contentString = (string) $xml->children($contentNS)->encoded;
        if (strlen($contentString) > 12 and strlen($titleString) > 4)  {
            // Be careful with your output escaping!
            // This below looks like it might be wrong:
            // - $titleString for an ID (use slug)
            // - $titleString not escaped
            // - $contentString should be escaped? not sure here.
            // Have you considered using XMLWriter()?
            echo '
<div class="article-container" id="article-' .  $titleString . '">
    <a href="#top" class="top-link">Back to the Top</a>
    <h2>' .  $titleString . '</h2>
    <div class="articles">' . $contentString . '</div>
</div>';
        }
        $reader->next(); //skip the subtrees, go to next item sibling
        // we already expand()ed this so we don't need to walk it.
    }
}
Francis Avila
  • 31,233
  • 6
  • 58
  • 96
  • Ah okay. So that's how I check the lengths. But, my real question is how do I check each one, rather than the whole document? – Batfan Dec 15 '11 at 15:45
  • What do you mean by "each one"? Doesn't your second code block do what you want? Then just move its `strlen()` checks to the inner loop of your first code block. – Francis Avila Dec 15 '11 at 16:07
  • It did but, simpleXML by itself used too much memory. Since this new combo method does not use the 'foreach' functionality, I'm unsure of how to check each of the specified nodes for these 2 conditions. – Batfan Dec 15 '11 at 16:15
  • SimpleXML for a *single item* uses too much memory? You do the check the same way for all SimpleXML items as you do for only one. – Francis Avila Dec 15 '11 at 16:28
  • Sorry, guess I didn't specifically mention that in the post. The WP export XML file I'm loading is 14+ MB (gzipped, Uncompressed it is 77+ MB) with around 9,000 records. I was constantly running into memory errors running just simpleXML – Batfan Dec 15 '11 at 16:48
  • I'm trying to check all 9,000 of these records individually, for these 2 conditions. – Batfan Dec 15 '11 at 16:57
  • I added a complete example. What I did is so straightforward that I think I may misunderstand what you are asking... – Francis Avila Dec 15 '11 at 17:47
  • Hmmm, that is giving me a warning and an error. -- Warning: simplexml_import_dom(): Imported Node must have associated Document -- Fatal Error: Call to a member function children() on a non-object – Batfan Dec 15 '11 at 18:13
  • Hmm, perhaps older versions of PHP don't have the argument to `expand()`? Use your previous `importNode()` method. – Francis Avila Dec 15 '11 at 18:27
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/5890/discussion-between-francis-avila-and-batfan) – Francis Avila Dec 15 '11 at 18:28