To do that in PHP you first have to convert the document into a DOMDocument so that you can address the nodes you want to normalize the whitespace within properly via DOMXPath. The (xpath in) SimpleXMLElement is too limited to access text-nodes precisely enough as it would be needed for this operation.
An Xpath-query to access all text-nodes that are within leaf-elements and all attributes is:
//*[not(*)]/text() | //@*
Given that $xml
is a SimpleXMLElement you could do white-space normalization like in the following example:
$doc = dom_import_simplexml($xml)->ownerDocument;
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//*[not(*)]/text()|//@*') as $node) {
/** @var $node DOMText|DOMAttr */
$node->nodeValue = trim(preg_replace('~\s+~u', ' ', $node->nodeValue), ' ');
}
You could perhaps stretch this to all text-nodes (as suggested in related Q&A), but this might require document normalization under circumstance. As text()
in Xpath does not differ between text-nodes and Cdata-sections, you might want to skip on these type of nodes (DOMCdataSection) or expand them into text-nodes when loading the document (use the LIBXML_NOCDATA
option for that) to achieve more useful results.
Also the data doesn't appear to be string, I need to append (string) before each variable. Why?
Because it's an object of type SimpleXMLElement, if you want the string value of such an object (element), you need to cast it to string. See as well the following reference question:
And last but not least: don't trust print_r
or var_dump
when you use it on a SimpleXMLElement: it's not showing the truth. E.g. you could override __toString()
which could also solve your issue:
class TrimXMLElement extends SimpleXMLElement
{
public function __toString()
{
return trim(preg_replace('~\s+~u', ' ', parent::__toString()), ' ');
}
}
$xml = simplexml_load_string($buffer, 'TrimXMLElement');
print_r($xml);
Even though casting to string would normally apply (e.g. with echo
), the output of print_r
still would not reflect these changes. So better not rely on it, it can never show the whole picture.
Full example code to this answer (Online Demo):
<?php
/**
* Remove starting and ending spaces from XML elements
*
* @link https://stackoverflow.com/a/31793566/367456
*/
$buffer = <<<XML
<data version="2.0">
<field>
1
</field>
<field something=" some attribute here... ">
2 <![CDATA[ 34 ]]>
</field>
</data>
XML;
class TrimXMLElement extends SimpleXMLElement implements JsonSerializable
{
public function __toString()
{
return trim(preg_replace('~\s+~u', ' ', parent::__toString()), ' ');
}
function jsonSerialize()
{
$array = (array) $this;
array_walk_recursive($array, function(&$value) {
if (is_string($value)) {
$value = trim(preg_replace('~\s+~u', ' ', $value), ' ');
}
});
return $array;
}
}
$xml = simplexml_load_string($buffer, 'TrimXMLElement', LIBXML_NOCDATA);
print_r($xml);
echo json_encode($xml);
$xml = simplexml_load_string($buffer, null, LIBXML_NOCDATA);
$doc = dom_import_simplexml($xml)->ownerDocument;
$doc->normalizeDocument();
$doc->normalize();
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//*[not(*)]/text()|//@*') as $node) {
/** @var $node DOMText|DOMAttr|DOMCdataSection */
if ($node instanceof DOMCdataSection) {
continue;
}
$node->nodeValue = trim(preg_replace('~\s+~u', ' ', $node->nodeValue), ' ');
}
echo $xml->asXML();