1

I need to store content in an xml database. some data in the database looks like this:

<item>
    <span class ="person">Henry 8<sup>th</sup></span>
</item>

<item>
    <span class="company">Berkley & Jensen</span>
</item>

I need to load the data into a dom object with loadXML() then pass it to a xsl stylesheet where it is further manipulated using xpath and css. When I load the data the code breaks because of the '&' and I do not want to convert all entities because I need to use css on <sup> and the xpath on the 'class' and I suspect that encoded entities will cause them to fail. How should I store and retrieve the illegal characters?

Because of the comments I am providing a sample php script. If you add the php tags it should run. Thank you for the CDATA suggestion. I have used it to demonstrate the problem. If I try to use the 'block' tag as a target for the XPATH it works fine but if I try to use the 'span' tag it prints nothing.

$xsl = <<<XSL
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template name="doContent" match="/">

<div class="story">
  <xsl:for-each select="//body/block">     <xsl:copy-of select="." />
  </xsl:for-each>
</div>

</xsl:template>

</xsl:stylesheet>     
XSL;

$xml = <<<XML
<?xml version="1.0" encoding="utf-8"?>
<content id="test" >
  <headline>test</headline>
  <author>test</author>
  <body>
    <block id="1"><![CDATA[<span class="normal"><p>1</p></span>]]></block>
    <block id="2"><![CDATA[<span class=""><p>2</p></span>]]></block>
    <block id="3"><![CDATA[<span class ="person">Henry 8<sup>th</sup></span>]]></block>
    <block id="4"><![CDATA[<span class="company">Berkley & Jensen</span>]]></block>
    <block id="5"><![CDATA[<span class=""><p>5</p></span>]]></block>
    <block id="6"><![CDATA[<span class=""><p>6</p></span>]]></block>
  </body>
</content>
XML;

   $xslDoc = new DOMDocument();
   $xslDoc->loadXML($xsl);

   $xmlDoc = new DOMDocument();
   $xmlDoc->loadXML($xml);

   $proc = new XSLTProcessor();
   $proc->importStylesheet($xslDoc);
   echo $proc->transformToXML($xmlDoc);
user1123382
  • 120
  • 6
  • Why to you believe that something would break? Have you tried? in your post you say "I need to use CSS on " Where is this in your data? – Kevin Brown Nov 07 '13 at 08:26
  • There is no reason to assume two standard XML technologies would fail because the XML is using valid entities. Is the issue that your CSS or xpath is not following standards? – Anthony Nov 07 '13 at 09:16
  • the sup is wrapped around the 'th' for 8th. the css styles my html tags in the inner HTML and the xpath finds elements with particular attributes for special treatment. I suspect that that both will ignore the tags if they are CDATA or encoded but I am not an XSL expert. This is just a sample of the data but there are more with similar issues. – user1123382 Nov 07 '13 at 10:47

2 Answers2

0

Wrap it into <![CDATA[]]>:

<item>
    <![CDATA[<span class="company">Berkley & Jensen</span>]]>
</item>

More on CDATA: What does <![CDATA[]]> in XML mean?

Community
  • 1
  • 1
michi
  • 6,565
  • 4
  • 33
  • 56
0

i was able to resolve my situation with a function that I created to sanitise the unwanted characters. You can try it with the sample xml that I gave above. notice that I use loadHTML NOT loadXML!

function clean_invalid_nodes(&$node)
{
  global $xpath, $xmlDoc;
  $nodes = $xpath->query("child::node()",$node);
  foreach ($nodes as $n) 
  {
    if ($n->nodeType == XML_ELEMENT_NODE) clean_invalid_nodes($n);
    elseif ($n->nodeType == XML_TEXT_NODE) 
    {
      if(trim($n->nodeValue)!='')
      { 
        $newnode = $xml->createTextNode(htmlentities($xmlDoc ->saveXML($n), ENT_SUBSTITUTE, 'utf-8'));
        $n->parentNode->replaceChild($newenode, $n);
      }
    }
  }
}

$xmlDoc = new DOMDocument();
@$xmlDoc->loadHTML($xml);
$xpath = new DomXPath($xmlDoc);

$nodes = $xpath->query("//span");
foreach ($nodes as $node)  clean_invalid_nodes($node);
$out = $xpath->query("//html/body")->item(0);
echo $xmlDoc ->saveXML($out);
user1123382
  • 120
  • 6