2

I have a large XML file that I want to parse and put into a database. For example, this file:

<aa>
    <bb>Some text goes here and <br /> some more on a new line
    there are other <junk/> tags that I want to keep ignoring
    </bb>
</aa>

My code below uses SimpleXML to parse the text content inside the bb tag, but it silently ignores the <br /> tag. How can I modify my code to accept <br/> but not <junk/>?

$xml = simplexml_load_file("ab.xml");
foreach( $xml->bb as $bb ) {
    // $bb now contains the text content of the element, but no tags
}
David
  • 943
  • 1
  • 10
  • 26
  • The result should read: "Some text goes here and
    some more on a new line there are other tags that I want to keep ignoring"
    – David Oct 01 '14 at 22:47
  • Ok then edit the stripped values....`echo strip_tags($bb,"
    ");` Now, all that it will keep is the `
    ` tags....try it, you will see it will work. Whatever is inside the quotes it will keep and strip any other tag. Believe me!!
    – Rasclatt Oct 01 '14 at 22:50
  • When I run my code, $bb = `"Some text goes here and some more on a new line there are other tags that I want to keep ignoring"` - there are no tags in it. – David Oct 01 '14 at 23:02
  • No, I'm saving it in a MySQL database. I discovered this when I viewed the database entry for certain data. – David Oct 01 '14 at 23:08
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/62298/discussion-between-david-and-rasclatt). – David Oct 01 '14 at 23:18
  • I eventually solved this by using dom_import_simplexml and the DOMElement class. – David Oct 02 '14 at 19:28
  • You should answer your own question so that people know how you solved it. I can see that one being hard to figure out from this tread...Also good to know if I ever have to parse xml!!! – Rasclatt Oct 02 '14 at 19:30

3 Answers3

1

You could strip tags if you know which you want to keep and which you want to remove.

$xml = simplexml_load_file("ab.xml");
foreach( $xml->bb as $bb ) {
    // This will strip everything but <br>
    echo strip_tags($bb,"<br>");
}
Rasclatt
  • 12,498
  • 3
  • 25
  • 33
  • Perhaps my wording was unclear. `$bb` only contains the text content of `` so `strip_tags` will have no effect. – David Oct 01 '14 at 22:42
  • If what you are looking to product is: `Some text goes here and some more on a new line there are other tags that I want to keep ignoring`, the what I have will work. – Rasclatt Oct 01 '14 at 22:46
  • That's not it. $bb doesn't contain any tags, it silently remove both `
    ` and ``. I want it to keep one but not the other.
    – David Oct 01 '14 at 22:47
1

I couldn't solve my issue using SimpleXML, but I was successful using DOMElement with a recursive approach. Note that the tag selection criteria is inside the recursive function.

// SimpleXML can be used for the 'simple' cases
$xml = simplexml_load_file("file.xml");
$dom = dom_import_simplexml($xml);
// simpleXML and DOM works with the same underlying data structure, so you can use them interchangably

$aa_content = $xml->aa;
// using simpleXML, $aa is now: "Some text goes here and some more on a new line there are other tags that I want to keep ignoring"
// the <junk> tag is ignore, which is good; but the <br> tag is also ignored, which is bad


// the DOM method
foreach( $dom->childNodes as $node ) {
    $textContent = parsePreserveTags($node);
}

function parsePreserveTags($domNode) {
    // we want to preserve tags (for example, html formatting like <br>)
    $result = '';//$domNode->nodeValue;
    if( $domNode->hasChildNodes() ) {
        foreach( $domNode->childNodes as $node ) {
            // The constant XML_ELEMENT_NODE is defined here http://php.net/manual/en/dom.constants.php
            // If node type is XML_ELEMENT_NODE it's a tag and it can have children.
            // Otherwise, just get the (text) value.
            if( $node->nodeType == XML_ELEMENT_NODE ) {
                // Throw away nodes that match certain criteria
                if( $node->nodeName == 'junk' )
                    continue;

                if( $node->hasChildNodes() ) {
                    // example: "<p>...</p>"
                    $result .= '<' . $node->nodeName . '>' . parsePreserveTags($node)
                        . '</' . $node->nodeName . '>';
                } else {
                    // example: "<br/>"
                    $result .= '<' . $node->nodeName . '/>';
                }
            } else {
                // example: plain text node
                $result .= $node->nodeValue;
            }
        }
    }
    return $result;
}
David
  • 943
  • 1
  • 10
  • 26
1

As you can exactly say which elements you want to be removed, this is normally done easiest with xpath by querying these elements and then removing them.

In SimpleXML:

$remove = '//junk'; // all <junk> tags anywhere

// simplexml
$sx = simplexml_load_string($xml);
foreach ($sx->xpath($remove) as $element) {
    unset($element->{0});
}

In DOMDocument:

$remove = '//junk'; // all <junk> tags anywhere

// dom
$doc = new DOMDocument();
$doc->loadXML($xml);
$xpath = new DOMXPath($doc);
foreach ($xpath->query($remove) as $element) {
    $element->parentNode->removeChild($element);
}

Full Example (Demo):

<?php
/**
 * @link http://stackoverflow.com/a/26318711/367456
 * @link https://eval.in/204702
 */

$xml = <<<BUFFER
<aa>
    <bb>Some text goes here and <br /> some more on a new line
    there are other <junk/> tags that I want to keep ignoring
    </bb>
</aa>
BUFFER;

$remove = '//junk'; // all <junk> tags anywhere

// simplexml
$sx = simplexml_load_string($xml);
foreach ($sx->xpath($remove) as $element) {
    unset($element->{0});
}
$sx->asXML('php://output');

// dom
$doc = new DOMDocument();
$doc->loadXML($xml);
$xpath = new DOMXPath($doc);
foreach ($xpath->query($remove) as $element) {
    $element->parentNode->removeChild($element);
}
$doc->save('php://output');

Output:

<?xml version="1.0"?>
<aa>
    <bb>Some text goes here and <br/> some more on a new line
    there are other  tags that I want to keep ignoring
    </bb>
</aa>
<?xml version="1.0"?>
<aa>
    <bb>Some text goes here and <br/> some more on a new line
    there are other  tags that I want to keep ignoring
    </bb>
</aa>
hakre
  • 193,403
  • 52
  • 435
  • 836