64

I need to save content that containing newlines in some XML attributes, not text. The method should be picked so that I am able to decode it in XSLT 1.0/ESXLT/XSLT 2.0

What is the best encoding method?

Please suggest/give some ideas.

David Hall
  • 32,624
  • 10
  • 90
  • 127
Tommy
  • 1,960
  • 1
  • 19
  • 32
  • possible duplicate of [Are line breaks in XML attribute values valid?](http://stackoverflow.com/questions/449627/are-line-breaks-in-xml-attribute-values-valid) – Ciro Santilli OurBigBook.com Jul 28 '14 at 09:36
  • made an example for a similar question: http://stackoverflow.com/a/29782321/611007 – n611x007 Apr 22 '15 at 08:09
  • related: https://stackoverflow.com/questions/260436/ - related: https://stackoverflow.com/questions/449627/ - related: https://stackoverflow.com/questions/1289524/ – n611x007 Apr 22 '15 at 10:59

4 Answers4

78

In a compliant DOM API there is nothing you need to do. Simply save actual newline characters to the attribute, the API will encode them correctly on its own (see Canonical XML spec, section 5.2).

If you do your own encoding (i.e. replacing \n with 
 before saving the attribute value), the API will encode your input again, resulting in 
 in the XML file.

Bottom line is, the string value is saved verbatim. You get out what you put in, no need to interfere.

However… some implementations are not compliant. For example, they will encode & characters in attribute values, but forget about newline characters or tabs. This puts you in a losing position since you can't simply replace newlines with 
 beforehand.

These implementations will save newline characters unencoded, like this:

<xml attribute="line 1
line 2" />

Upon parsing such a document, literal newlines in attributes are normalized into a single space (again, in accordance to the spec) - and thus they are lost.

Saving (and retaining!) newlines in attributes is impossible in these implementations.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • Something I ran into: XML uses Unix-style newlines (LF). So if you want to store Windows-style newlines (CR+LF), you'll either need to convert the newlines after reading from your attribute, or escape the newlines somehow. Source: http://www.w3schools.com/xml/xml_syntax.asp – Joe Jun 29 '11 at 14:34
  • 3
    @Joe: Where do you take the info from that XML uses Unix-style newlines? As far as I can see, [the spec](http://www.w3.org/TR/xml/) does not restrict that. – Tomalak Jun 29 '11 at 14:49
  • @Tomalak Scroll down to the bottom of that link. Look for the heading "XML Stores New Line as LF". I noticed this in practice too--both the XmlWriter in C# and in a 3rd party component strips out the CR characters (leaving just LFs, like Unix). – Joe Jun 29 '11 at 15:12
  • 5
    @Joe: Sorry, I don't give w3schools a lot of credibility. If it was in the spec, that would be a different matter. – Tomalak Jun 29 '11 at 15:14
  • 3
    @Tomalak: Hmm, ok that's fair then. I saw the effects before I even looked it up. Here it is from the spec: http://www.w3.org/TR/xml/#sec-line-ends -- quoted "To simplify the tasks of applications, the XML processor must behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character." – Joe Jun 29 '11 at 15:22
  • @Joe: Ah, I see. Thanks for pointing this out. However, that's a slightly different issue. An attribute like `a=" "` will not be affected by this rule - it does not contain actual CR or LF characters, only their *references*. After parsing, a CRLF sequence *will* be in the attribute value. And if you save a CRLF to an attribute value it *should* be serialized as ` ` again, unless I'm misinterpreting it. – Tomalak Jun 29 '11 at 15:33
  • @Tomalak: That's what was interesting. When we used the 3rd party component (this was our first attempt to keep CRLF), it actually *did* remove the entity. I couldn't tell you whether that's part of the spec or an extra step taken though. – Joe Jun 29 '11 at 15:54
  • @Tomalak The framework (System.Xml) implementation is not compliant. A possible fix is { var a = elem.Attributes[0]; a.InnerText = s; a.InnerXml = a.InnerXml.Replace("\r\n", " "); }, though it'll cost time, and gets a little more complicated if you need to handle any non-windows newlines. – The Dag Aug 01 '12 at 11:53
  • 2
    The .NET Framework's XmlWriter can be made to behave correctly and (reasonably) sensibly using [the NewLineHandling property](https://msdn.microsoft.com/en-us/library/system.xml.xmlwritersettings.newlinehandling(v=vs.110).aspx) (by setting it to Entitize). Unfortunately, preservation of newlines is impossible in the XML DOM as implemented in Firefox - [a 2002 bug](https://bugzilla.mozilla.org/show_bug.cgi?id=169521) - while Chrome's implementation does the right thing. – MvanGeest Jun 25 '16 at 23:15
  • @Tomalak could you please help me with this question http://stackoverflow.com/questions/39039416/unable-to-use-here-document-delimited-by-eof-within-phing-xml? – Sandeepan Nath Aug 22 '16 at 06:39
  • @Sandeepan No, sorry. Bash escaping issues are not my strong suit. I strongly recommend not trying to embed XML into bash scripts. Use actual XML files and XML-aware tools (xmlstarlet, xsltproc, maybe even xmlsh). – Tomalak Aug 22 '16 at 07:04
  • @Tomalak no issues :) However, mine is an XML script which has bash embedded, i.e. opposite of what you are saying. – Sandeepan Nath Aug 22 '16 at 07:40
  • @Sandeepan Ah, all-right. You can't put actual newline characters into an attribute value, as my answer above explains (Note that the first sentence talks about an API. Editing an XML file with a text editor is not the same thing.) My recommendation would be to write multi-line values into the an element instead. – Tomalak Aug 22 '16 at 08:04
  • @Tomalak I am not sure what you mean by writing multi-line values. I checked http://stackoverflow.com/questions/449627/are-line-breaks-in-xml-attribute-values-allowed and it seems having line breaks in XML attribute values is valid. – Sandeepan Nath Aug 22 '16 at 09:17
  • @Sandeepan Of course they are. Just not as literal characters. Just like `<` is valid in an attribute. Just not as a literal character. Read my answer again. It's all in there, really. :) – Tomalak Aug 22 '16 at 09:45
  • 1
    Looks like the Java XMLStreamWriter (at least the internal com.sun.xml one) is in the category of "impossible to do": https://stackoverflow.com/questions/8331364/how-to-preserve-whitespace-in-attributes-when-using-xmlstreamwriter – lmsurprenant Jul 18 '22 at 21:47
48

You can use the entity &#10; to represent a newline in an XML attribute. &#13; can be used to represent a carriage return. A windows style CRLF could be represented as &#13;&#10;.

This is legal XML syntax. See XML spec for more details.

Asaph
  • 159,146
  • 25
  • 197
  • 199
  • Is it a valid XML Character?? – Chathuranga Chandrasekara Jan 05 '10 at 05:49
  • I guess i have to use some encoding instead of entity As getAttribute wont work with a string containing newline. Do you have many idea? Will entity solve the getAttribute problem? – Tommy Jan 05 '10 at 05:57
  • @Chathuranga Chandrasekara: Yes. It's valid XML. I updated my answer to include a link to the XML spec where these symbols are mentioned. – Asaph Jan 05 '10 at 05:57
  • @Tommy: What programming language/API are you using? What is this `getAttribute()` method you speak of? – Asaph Jan 05 '10 at 05:58
  • @Asaph: Javascript. client side: javascript. server side: php (xslt 1.0/esxlt), tomcat (xslt 2.0 saxon8). – Tommy Jan 05 '10 at 06:02
  • @Tommy: Are you *sure* `getAttribute` won't decode ` ` and convert it to a newline? It should work. Did you test it? – Asaph Jan 05 '10 at 06:08
  • @Asaph could you please help me with this question http://stackoverflow.com/questions/39039416/unable-to-use-here-document-delimited-by-eof-within-phing-xml? – Sandeepan Nath Aug 22 '16 at 06:41
0

A slightly different approach that has been helpful in some situations-

Placeholders and Find & Replace.

Before parsing you can simply use your own custom linebreak marker/placeholder, then on the 2nd half of the situation just string replace it with whatever line break character is effective, whether that's \n or or or #&10; or \u2028 or any of the various line break characters out there. Find & replace them back in after setting the placeholder of your own in the data initially.

This is useful when parsers like jQuery $.parseXML() strip the unencoded line breaks. For example, you could use {LBREAK} as your line break char, insert it while raw text, and replace it later after parsed to an XML object. String.replaceAll() is a helpful prototype.

So rough code concept with jquery and a replaceAll prototype (have not tested this code but it will show the concept):

function onXMLHandleLineBreaks(_result){
    var lineBreakCharacterThatGetsLost = '&#10;';
    var lineBreakCharacterThatGetsLost = '&#xD;';
    var rawXMLText = _result; // hold as text only until line breaks are ready
        rawXMLText = String(rawXMLText).replaceAll(lineBreakCharacterThatGetsLost, '{mylinebreakmarker}'); // placemark the linebreaks with a regex find and replace proto
    var xmlObj = $.parseXML(rawXML); // to xml obj
    $(xmlObj).html( String(xmlObj.html()).replaceAll('{mylinebreakmarker}'), lineBreakCharacterThatWorks ); // add back in line breaks
    console.log('xml with linebreaks that work: ' + xmlObj);
}

And of course you could adjust the line break chars that work or don't work to your data situation, and you could put that in a loop for a set of line break characters that don't work and iterate through them to do a an entire set of linebreak characters.

OG Sean
  • 971
  • 8
  • 18
0

A crude answer can be:

XmlDocument xDoc = new XmlDocument();
xDoc.Load(@"Agenda.xml");
//make stuff with the xml
//make attributes value = "\r\n" (you need both expressions to make a new line)
string a = xDoc.InnerXml.Replace("&#xD;", "\r").Replace("&#xA;", "\n").Replace("><",">\r    \n<");
StreamWriter sDoc = new StreamWriter(@"Agenda.xml");
sDoc.Write(a);
sDoc.Flush();
sDoc.Dispose();

This will as you see is just a string

Asaph
  • 159,146
  • 25
  • 197
  • 199