1

I've got a problem with parsing an XML file (nb. well formed one).

Consider XML file like this:

<?xml version="1.0" encoding="utf-8" ?>
<root>
    <list>
        <item no="1">
            <title>Item's 1 title</title>
            <content>Some long content with <special>tags</special> inside</content>
        </item>
        <item no="2">
            <title>Item's 2 title</title>
            <content>Some long content with <special>tags</special> inside</content>
        </item>
    </list>
</root>

I need to get contents contents of each item in the list and put them in an array. Generally not a problem, but in this case, I can't get my head round it.

Problem lays in <content> contents. It is string with tags in-between. I can't find a way to extract the contents. SimpleXML returns/echoes just the string with anything including and inside <special> tags stripped out. Like this:

Some long content with inside.

I'd ideally want it to get a string like this:

Some long content with <special>tags</special> inside

How do I get it?

Michal M
  • 9,322
  • 8
  • 47
  • 63
  • 1
    possible duplicate of [PHP SimpleXML get innerXML](http://stackoverflow.com/questions/1937056/php-simplexml-get-innerxml) – Gordon Jun 21 '11 at 15:48
  • I don't think you're supposed to mix text nodes with other nodes. Ideally your XML should be like `<![CDATA[Some long content with tags inside]]>` which instructs parser not to parse content within CDATA tag (return it as is) – mkilmanas Jun 21 '11 at 15:57
  • @mkilmanas Well, that's what an application's API returns, so I have no choice there. – Michal M Jun 21 '11 at 15:59
  • @Gordon You might be right. Thanks for the link, will investigate. – Michal M Jun 21 '11 at 15:59
  • well, the accepted solution suggests to use a 3rd partly library. Personally, I'm not too fond of those non-native solutions, but that's just me. Anyways, if you want to investigate some more you now know the term: innerXML. – Gordon Jun 21 '11 at 16:28

3 Answers3

3

You could use DOMDocument which is built into PHP.

<?php

$xml = <<<END
<?xml version="1.0" encoding="utf-8" ?>
<root>
    <list>
        <item no="1">
            <title>Item's 1 title</title>
            <content>Some long content with <special>tags</special> inside</content>
        </item>
        <item no="2">
            <title>Item's 2 title</title>
            <content>Some long content with <special>tags</special> inside</content>
        </item>
    </list>
</root>
END;

$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadXML($xml);

$nodes = $doc->getElementsByTagName('content');

foreach ( $nodes as $node )
{
  $temp_doc = new DOMDocument('1.0', 'UTF-8');

  foreach ( $node->childNodes as $child )
    $temp_doc->appendChild($temp_doc->importNode($child, true));

  echo $temp_doc->saveHTML(); // Outputs: Some long content with <special>tags</special> inside
}

To select the top level "content" elements (in case there are "content" elements inside), you can use DOMXPath.

$doc = new DOMDocument('1.0', 'UTF-8');
$doc->loadXML($xml); // $xml from the example above

$xpath = new DOMXPath($doc);

$nodes = $xpath->query('/root/list/item/content');

foreach ( $nodes as $node )
{
  $temp_doc = new DOMDocument('1.0', 'UTF-8');

  foreach ( $node->childNodes as $child )
    $temp_doc->appendChild($temp_doc->importNode($child, true));

  echo $temp_doc->saveHTML(); // Outputs: Some long content with <special>tags</special> inside
}
Francois Deschenes
  • 24,816
  • 4
  • 64
  • 61
0

SimpleXML just doesn't support mixed content (text nodes with element nodes as siblings). I suggest you use XMLReader instead.

Erlock
  • 1,968
  • 10
  • 11
0

You could use SimpleXML's asXML function. It will return that called node as the xml string;

$xml = simplexml_load_file($file);
foreach($xml->list->item as $item) {
    $content = $item->contents->asXML();
    echo $content."\n";
}

will print:

<content>Some long content with <special>tags</special> inside</content>
<content>Some long content with <special>tags</special> inside</content>

it's a little ugly but you could then clip out the <content> and </content> with a substr:

$content = substr($content,9,-10);
ben
  • 1,946
  • 2
  • 18
  • 26