8

I have to parse externally provided XML that has attributes with line breaks in them. Using SimpleXML, the line breaks seem to be lost. According to another stackoverflow question, line breaks should be valid (even though far less than ideal!) for XML.

Why are they lost? [edit] And how can I preserve them? [/edit]

Here is a demo file script (note that when the line breaks are not in an attribute they are preserved).

PHP File with embedded XML

$xml = <<<XML
<?xml version="1.0" encoding="utf-8"?>
<Rows>
    <data Title='Data Title' Remarks='First line of the row.
Followed by the second line.
Even a third!' />
    <data Title='Full Title' Remarks='None really'>First line of the row.
Followed by the second line.
Even a third!</data>
</Rows>
XML;

$xml = new SimpleXMLElement( $xml );
print '<pre>'; print_r($xml); print '</pre>';

Output from print_r

SimpleXMLElement Object
(
    [data] => Array
        (
            [0] => SimpleXMLElement Object
                (
                    [@attributes] => Array
                        (
                            [Title] => Data Title
                            [Remarks] => First line of the row. Followed by the second line. Even a third!
                        )

                )

            [1] => First line of the row.
Followed by the second line.
Even a third!
        )

)
Community
  • 1
  • 1
Joshua
  • 143
  • 1
  • 2
  • 7
  • You should ask this question in PHP homepage. I guess it's because it's SIMPLE xml parser. – jbasko Sep 21 '09 at 23:25
  • Can you explain a bit more what you mean by the PHP homepage? – Joshua Sep 22 '09 at 00:47
  • Initially your question was "Why SimpleXML does what it does?" That's what you can ask it's developers not users. – jbasko Sep 22 '09 at 00:55
  • Gotcha - thanks for the recommendation, Zilupe. Now that bobince has answered "Why SimpleXML does what it does?" I think I'll keep this on stackoverflow so that hopefully someone can add on with what other options I have to keep line breaks! – Joshua Sep 22 '09 at 01:02

6 Answers6

13

Using SimpleXML, the line breaks seem to be lost.

Yes, that is expected... in fact it is required of any conformant XML parser that newlines in attribute values represent simple spaces. See attribute value normalisation in the XML spec.

If there was supposed to be a real newline character in the attribute value, the XML should have included a &#10; character reference instead of a raw newline.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • 3
    To clarify just a little bit: the newlines are *VALID*, but the XML parser (in order to be compliant with the spec) **MUST** reduce them down to a single space character (see item 3 of bobince's link). – TML Sep 22 '09 at 00:09
  • Thanks for the link bobince, and the clarification TML. So I suppose my question now becomes, how can I retain those line breaks? I am receiving this data from a SharePoint web service, so I can't change the XML to include . Is there a way to override the parser compliance in this regard? – Joshua Sep 22 '09 at 00:37
  • Unfortunately no, XML is quite unflexible on this point; if the web service is producing `\n` when it means ` ` it's a bug. (And a surprising one as this is a fundamental feature that any XML serialiser would be expected to get right... unless of course the service is mucking around with regex or string templating instead of using a proper XML library!) – bobince Sep 22 '09 at 01:03
  • Unless you have access to subclass or monkey-patch your XML parser it's not something you're going to be able to change... and I think SimpleXML uses libxml, which you've no hope of fiddling with from PHP. Pre-processing general XML input to put the ` `s in is also a bit of a non-starter, as you'd have to write most of an XML parser already to be able to tell the difference between a newline in an attribute value and one directly inside a tag (where ` ` would be illegal). Hacks like Anthony's could work as a temporary fix if the exact formatting is very locked down at the moment. – bobince Sep 22 '09 at 01:09
  • (sorry about the `code` there, seems to be a flaw in SO's markup around `&...;` or something...) – bobince Sep 22 '09 at 01:10
4

The entity for a new line is &#10;. I played with your code until I found something that did the trick. It's not very elegant, I warn you:

//First remove any indentations:
$xml = str_replace("     ","", $xml);
$xml = str_replace("\t","", $xml);

//Next replace unify all new-lines into unix LF:
$xml = str_replace("\r","\n", $xml);
$xml = str_replace("\n\n","\n", $xml);

//Next replace all new lines with the unicode:
$xml = str_replace("\n","&#10;", $xml);

Finally, replace any new line entities between >< with a new line:
$xml = str_replace(">&#10;<",">\n<", $xml);

The assumption, based on your example, is that any new lines that occur inside a node or attribute will have more text on the next line, not a < to open a new element.

This of course would fail if your next line had some text that was wrapped in a line-level element.

Anthony
  • 36,459
  • 25
  • 97
  • 163
  • Very Clever!!! The only catch is that I'm working with massive SOAP-enveloped XML spewing from SharePoint web services, so it makes me a bit nervous to do something so brute force. Based on bobince's post though, it looks like I might have to go this direction. I wonder if there is any more elegant way to pull it off. – Joshua Sep 22 '09 at 00:38
1

Here is code to replace the new lines with the appropriate character reference in that particular XML fragment. Run this code prior to parsing.

$replaceFunction = function ($matches) {
    return str_replace("\n", "&#10;", $matches[0]);
};
$xml = preg_replace_callback(
    "/<data Title='[^']+' Remarks='[^']+'/i",
    $replaceFunction, $xml);
humbads
  • 3,252
  • 1
  • 27
  • 22
1

Assuming $xmlData is your XML string before it is sent to the parser, this should replace all newlines in attributes with the correct entity. I had the issue with XML coming from SQL Server.

$parts = explode("<", $xmlData); //split over <
array_shift($parts); //remove the blank array element
$newParts = array(); //create array for storing new parts
foreach($parts as $p)
{
    list($attr,$other) = explode(">", $p, 2); //get attribute data into $attr
    $attr = str_replace("\r\n", "&#10;", $attr); //do the replacement
    $newParts[] = $attr.">".$other; // put parts back together
}
$xmlData = "<".implode("<", $newParts); // put parts back together prefixing with <

Probably can be done more simply with a regex, but that's not a strong point for me.

Ryan
  • 11
  • 1
  • Exactly, the problem is that newlines are technically not valid in XML attributes. However, parsers tend to fix things a lot. In all cases, the invalid entities should be encoded. The best solution would be to fix the source, but this seems legit if that is not available. – Kevin Peno Nov 28 '12 at 22:57
0

This is what worked for me:

First, get the xml as a string:

    $xml = file_get_contents($urlXml);

Then do the replacement:

    $xml = str_replace(".\xe2\x80\xa9<as:eol/>",".\n\n<as:eol/>",$xml);

The "." and "< as:eol/ >" were there because I needed to add breaks in that case. The new lines "\n" can be replaced with whatever you like.

After replacing, just load the xml-string as a SimpleXMLElement object:

    $xmlo = new SimpleXMLElement( $xml );

Et Voilà

German
  • 67
  • 1
  • 3
0

Well, this question is old but like me, someone might come to this page eventually. I had slightly different approach and I think the most elegant out of these mentioned.

Inside the xml, you put some unique word which you will use for new line.

Change xml to

<data Title='Data Title' Remarks='First line of the row. \n
Followed by the second line. \n
Even a third!' />

And then when you get path to desired node in SimpleXML in string output write something like this:

$findme  = '\n';
$pos = strpos($output, $findme);
if($pos!=0)
{
$output = str_replace("\n","<br/>",$output);

It doesn't have to be '\n, it can be any unique char.