Remove starting and ending spaces from XML elements

Question

How can I remove all spacing characters before and after a XML field?

<data version="2.0">

  <field> 

     1 

  </field>        

  <field something=" some attribute here... "> 

     2  

  </field>

</data>

Notice that spacing before 1 and 2 and 'some attribute here...', I want to remove that with PHP.

if(($xml = simplexml_load_file($file)) === false) die();

print_r($xml);

Also the data doesn't appear to be string, I need to append (string) before each variable. Why?

please see my answer at http://stackoverflow.com/questions/8200582/remove-newline-from-xml-element-value/8200664#8200664 for a possible solution — Gordon, Nov 20 '11 at 10:26

score 2 · Answer 1 · edited May 23 '17 at 11:58

To do that in PHP you first have to convert the document into a DOMDocument so that you can address the nodes you want to normalize the whitespace within properly via DOMXPath. The (xpath in) SimpleXMLElement is too limited to access text-nodes precisely enough as it would be needed for this operation.

An Xpath-query to access all text-nodes that are within leaf-elements and all attributes is:

//*[not(*)]/text() | //@*

Given that $xml is a SimpleXMLElement you could do white-space normalization like in the following example:

$doc   = dom_import_simplexml($xml)->ownerDocument;
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//*[not(*)]/text()|//@*') as $node) {
    /** @var $node DOMText|DOMAttr */
    $node->nodeValue = trim(preg_replace('~\s+~u', ' ', $node->nodeValue), ' ');
}

You could perhaps stretch this to all text-nodes (as suggested in related Q&A), but this might require document normalization under circumstance. As text() in Xpath does not differ between text-nodes and Cdata-sections, you might want to skip on these type of nodes (DOMCdataSection) or expand them into text-nodes when loading the document (use the LIBXML_NOCDATA option for that) to achieve more useful results.

Also the data doesn't appear to be string, I need to append (string) before each variable. Why?

Because it's an object of type SimpleXMLElement, if you want the string value of such an object (element), you need to cast it to string. See as well the following reference question:

Forcing a SimpleXML Object to a string, regardless of context

And last but not least: don't trust print_r or var_dump when you use it on a SimpleXMLElement: it's not showing the truth. E.g. you could override __toString() which could also solve your issue:

class TrimXMLElement extends SimpleXMLElement
{
    public function __toString()
    {
        return trim(preg_replace('~\s+~u', ' ', parent::__toString()), ' ');
    }
}

$xml = simplexml_load_string($buffer, 'TrimXMLElement');

print_r($xml);

Even though casting to string would normally apply (e.g. with echo), the output of print_r still would not reflect these changes. So better not rely on it, it can never show the whole picture.

Full example code to this answer (Online Demo):

<?php
/**
 * Remove starting and ending spaces from XML elements
 *
 * @link https://stackoverflow.com/a/31793566/367456
 */

$buffer = <<<XML
<data version="2.0">

  <field>

     1

  </field>

  <field something=" some attribute here... ">

     2 <![CDATA[ 34 ]]>

  </field>

</data>
XML;

class TrimXMLElement extends SimpleXMLElement implements JsonSerializable
{
    public function __toString()
    {
        return trim(preg_replace('~\s+~u', ' ', parent::__toString()), ' ');
    }

    function jsonSerialize()
    {
        $array = (array) $this;

        array_walk_recursive($array, function(&$value) {
            if (is_string($value)) {
                $value  = trim(preg_replace('~\s+~u', ' ', $value), ' ');
            }
        });

        return $array;
    }
}

$xml = simplexml_load_string($buffer, 'TrimXMLElement', LIBXML_NOCDATA);

print_r($xml);
echo json_encode($xml);

$xml = simplexml_load_string($buffer, null, LIBXML_NOCDATA);

$doc = dom_import_simplexml($xml)->ownerDocument;
$doc->normalizeDocument();
$doc->normalize();

$xpath = new DOMXPath($doc);
foreach ($xpath->query('//*[not(*)]/text()|//@*') as $node) {
    /** @var $node DOMText|DOMAttr|DOMCdataSection */
    if ($node instanceof DOMCdataSection) {
        continue;
    }
    $node->nodeValue = trim(preg_replace('~\s+~u', ' ', $node->nodeValue), ' ');
}

echo $xml->asXML();

score 2 · Answer 2 · answered Sep 07 '11 at 17:33

You may want to use something like this:

$str = file_get_contents($file);
$str = preg_replace('~\s*(<([^>]*)>[^<]*</\2>|<[^>]*>)\s*~','$1',$str);
$xml = simplexml_load_string($xml,'SimpleXMLElement', LIBXML_NOCDATA);

I haven't tried this, but you can find more on this at http://www.lonhosford.com/lonblog/2011/01/07/php-simplexml-load-xml-file-preserve-cdata-remove-whitespace-between-nodes-and-return-json/.

Note that the spaces between the opening and closing brackets (<x> _space_ </x>) and the attributes (<x attr=" _space_ ">) are actually part of the XML document's data (in contrast with the spaces between <x> _space_ <y>), so I would suggest that the source you use should be a bit less messy with spaces.

score 1 · Answer 3 · answered Sep 07 '11 at 17:29

1

Since simplexml_load_file() reads data into an array, you could do something like this:

function TrimArray($input){

    if (!is_array($input))
        return trim($input);

    return array_map('TrimArray', $input);
}

answered Sep 07 '11 at 17:29

CodeCaster

147,647
23
218
272

No, it does not read data into an array, but it creates a **SimpleXMLElement** out of it. And that object can be case into string (which is what happens when you call `trim` on it). – hakre Aug 03 '15 at 17:52

Remove starting and ending spaces from XML elements

3 Answers3

Linked