7

According to this question:

Are line breaks in XML attribute values allowed?

line breaks in XML attributes are perfectly valid (although perhaps not recommended):

<xmltag1>
    <xmltag2 attrib="line 1
line 2
line 3">
    </xmltag2>
</xmltag1>

When I parse such XML using LINQ to XML (System.Xml.Linq), those line breaks are converted silently to space ' ' characters.

Is there any way to tell the XDocument.Load() parser to preserve those line breaks?

P.S.: The XML I'm parsing is written by third-party software, so I cannot change the way the line breaks are written.

Community
  • 1
  • 1
Jonas Sourlier
  • 13,684
  • 16
  • 77
  • 148
  • If you are writing attributes programatically look at this articlewhich shows different ways of escaping string.http://weblogs.sqlteam.com/mladenp/archive/2008/10/21/Different-ways-how-to-escape-an-XML-string-in-C.aspx keep in mind that not only linebreaks must be escaped. – George Mamaladze Jul 13 '12 at 08:48

3 Answers3

8

If you want line breaks in attribute values to be preserved then you need to write them with character references e.g.

<foo bar="Line 1.&#10;Line 2.&#10;Line3."/>

as other wise the XML parser will normalize them to spaces, according to the XML specification http://www.w3.org/TR/xml/#AVNormalize.

[edit] If you want to avoid the attribute value normalization then loading the XML with a legacy XmlTextReader helps:

            string testXml = @"<foo bar=""Line 1.
Line 2.
Line 3.""/>";

            XDocument test;
            using (XmlTextReader xtr = new XmlTextReader(new StringReader(testXml)))
            {
                xtr.Normalization = false;
                test = XDocument.Load(xtr);
            }
            Console.WriteLine("|{0}|", test.Root.Attribute("bar").Value);

That outputs

|Line 1.
Line 2.
Line 3.|
Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • Thank you, but as I wrote in my question, the XML is written by a third-party software, so I cannot change this. Maybe I need some kind of RegEx replace which converts the line breaks to – Jonas Sourlier Jul 13 '12 at 08:49
  • I saw that note in your question but in this case there is a clear specification and the result you get is complying with the specification. So I wrote that answer to point out that the behaviour you get is the right one, even if not wanted in your case. I think a legacy `XmlTextReader` however will allow you to avoid the attribute value normalization, so I will edit my answer to show that. – Martin Honnen Jul 13 '12 at 09:21
1

According to MSDN:

Although XML processors preserve all white space in element content, they frequently normalize it in attribute values. Tabs, carriage returns, and spaces are reported as single spaces. In certain types of attributes, they trim white space that comes before or after the main body of the value and reduce white space within the value to single spaces. (If a DTD is available, this trimming will be performed on all attributes that are not of type CDATA.)

For example, an XML document might contain the following:

<whiteSpaceLoss note1="this is a note." note2="this
is
a
note.">

An XML parser reports both attribute values as "this is a note.", converting the line breaks to single spaces.

I can't find anything about preserving whitespaces of attributes, but I guess it may be impossible according to this explanation.

Community
  • 1
  • 1
mmdemirbas
  • 9,060
  • 5
  • 45
  • 53
0

the line breaks are not spaces when parsed (not ASCII code 32) if you step through each letter you will see that the "space ' '" is a ASCII code 10 =LF(LineFeed)(!!) - so the linebreaks are still present if you need try to replace them with a ASCII 13 in your code... (textboxes (windows forms) not showing LF as a linebreak)

Cadburry
  • 1,844
  • 10
  • 21
  • Thank you, I tested that before, and I really got two ASCII code 32 characters where the line breaks should be. I'm going to test that again to be sure. – Jonas Sourlier Jul 13 '12 at 08:53
  • 1
    I tested it again. Both `'\r'` and `'\n'` characters in the XML attribute are converted to `' '` spaces (ASCII code 32). – Jonas Sourlier Jul 13 '12 at 08:56
  • u'r right - that applies to a cdata section - could not find a way currently to preserve the linebreaks. is an reaplace of 32 32 to LB an option for you? – Cadburry Jul 13 '12 at 09:21