2

I have to parse, modify and save a XML document that contains > in an attribute value.

Contrary to popular belief it's perfectly fine for this character to NOT be replaced with > as described in the standard:

The right angle bracket (>) may be represented using the string >, and must, for compatibility, be escaped using either > or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.

(2.4 Character Data and Markup)

I cannot allow the parser to modify the attribute values since existing code relies on the current form (also it would make the XML rather unwieldy).

A sample would be:

<?xml version="1.0" encoding="utf-8"?>
<Foo Name="a->b">
</Foo>

Neither XmlDocument nor XDocument can load and save this document without changing the a->b to a-&gt;b.

Is there any way to work around this? I could fix the data in a post-processing step, but there are situations where > must be escaped so this seems rather error-prone.

Voo
  • 29,040
  • 11
  • 82
  • 156
  • 1
    I think there is no workaround for this , al least not in any available lib. This answer explains it https://stackoverflow.com/questions/1091945/what-characters-do-i-need-to-escape-in-xml-documents – J.Memisevic Jun 21 '22 at 06:51
  • 1
    Do note that "existing code [relying] on the current form" is a bug, plain and simple -- while the encoding is not mandatory, of course escaped text in attributes should still be processed correctly, as in, unescaped. (XML is already unwieldy in and of itself, so that's not much of an argument.) – Jeroen Mostert Jun 21 '22 at 08:32
  • @Jeroen Sure, but that's out of my hands. And even if the parser could deal with it, it would make the file utterly unreadable and cause lots of unwanted changes which is a problem in itself. – Voo Jun 21 '22 at 09:06
  • 1
    A lame but workable approach is to shim things: have the XML code use a `Stream`(`Reader`/`Writer`) that proactively replaces `->` with that's unique but not modified by escaping (like `-~`) on reading, and the inverse on writing. Assuming the document was well-formed to begin with and didn't contain the replacement sequence, such a transformation would not introduce errors. – Jeroen Mostert Jun 21 '22 at 09:11
  • The file is not xml if it has the character "->"in the attribute. So you have a plain bad requirement and you must report the error. Either the requirement must be changed and say the file is not xml, or fix the bad character. – jdweng Jun 21 '22 at 10:03
  • 3
    @jdweng: no, `>` is allowed unescaped in attributes (and element content, for that matter) and the OP actually took the time to quote the relevant part of the spec. .NET just doesn't like leaving it unescaped. – Jeroen Mostert Jun 21 '22 at 10:30
  • @jdweng If you had read the complete post and not just the sample you would've seen that that's a misconception, I even quoted the relevant part of the standard.. – Voo Jun 21 '22 at 10:38
  • @Jeroen Yeah that's going to be my workaround, shouldn't even affect performance too much if done sensibly. Admittedly XML parsing is already complicated enough without making such optional flags available so I can understand that nobody implements this. – Voo Jun 21 '22 at 10:49
  • 3
    *Neither XmlDocument nor XDocument can **load** and save this document without changing the `a->b` to `a->b`* -- this is not strictly correct. `XDocument` can load your XML correctly and the value of `doc.Root.Attribute("Name").Value` will be `a->b` as required, not `a->b`. What is happening is that, when the document is re-serialized to XML, `XmlWriter` unconditionally escapes the `>` characters to `&gt`. See https://dotnetfiddle.net/HqUzsW for confirmation. With that in mind, is that still a problem? – dbc Jun 21 '22 at 23:07
  • Even if the tag was saved like ``, the attribute will be loaded correctly as `a->b`, (you can confirm by changing the `>` to `>` in the fiddle above), unless your old code used a non-standard parser. – qrsngky Jun 22 '22 at 03:34
  • @dbc Interesting, I just checked the output in the debugger which seems to use XmlWriter to create the representation. So is there a way to save the XDocument without using the XmlWriter? – Voo Jun 22 '22 at 07:05
  • @qrsngky Well yes that's the problem (also it's not "my" old code, because then I could change things..). But the XML files would be incredibly unreadable if converted, so this wouldn't be an option even if the parser could deal with it. – Voo Jun 22 '22 at 07:08
  • It's possible to traverse the document tree and convert each element to string (and deal recursively with the children), but you need to be careful with manually escaping. And if it's about readability, I guess you want to preserve the white space as well. – qrsngky Jun 22 '22 at 07:15

1 Answers1

3

XDocument (and more generally XmlReader) will load XML without converting > characters to &gt; (In fact just the opposite happens -- &gt; will be unescaped to > by XmlReader). You may verify that by doing:

var xmlString = @"<?xml version=""1.0"" encoding=""utf-8""?><Foo Name=""a->b""></Foo>";
var doc = XDocument.Parse(xmlString);
Assert.AreEqual("a->b", doc.Root.Attribute("Name").Value); // Passes successfully

Demo fiddle #1 here.

Instead what you are seeing is that, when writing your XDocument back to XML, XmlWriter unconditionally escapes > as &gt; even when not strictly necessary. (An XmlWriter is always used to format an XNode to XML, either explicitly when you construct it yourself to write to some Stream or TextWriter, or internally by XNode.ToString().)

If you don't want this, you will have to subclass XmlWriter and modify the logic of XmlWriter.WriteString(String) to use your preferred escaping. However XmlWriter itself is abstract; the XmlWriter returned by XmlWriter.Create() is some internal concrete subclass which cannot be subclassed directly. Thus you will need to use the decorator pattern to wrap the writer returned by XmlWriter.Create():

public class NoEndBracketEscapingXmlWriter : XmlWriterDecorator
{
    bool OnlyForAttributes { get; }

    public NoEndBracketEscapingXmlWriter(XmlWriter baseWriter) : this(baseWriter, false) { }
    public NoEndBracketEscapingXmlWriter(XmlWriter baseWriter, bool onlyForAttributes) : base(baseWriter) => this.OnlyForAttributes = onlyForAttributes;
    
    public override void WriteString(string text)
    {
        //The right angle bracket (>) may be represented using the string &gt;, and must, for compatibility, be escaped using either &gt; or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.
        if (WriteState == WriteState.Prolog || (WriteState != WriteState.Attribute && OnlyForAttributes))
        {
            base.WriteString(text);
            return;
        }
        
        int prevIndex = 0, index;
        char [] buffer = null;
        while ((index = text.IndexOf('>', prevIndex)) >= 0)
        {
            if (buffer == null)
                buffer = text.ToCharArray();
            if (WriteState != WriteState.Attribute && text.AsSpan().Slice(prevIndex, index - prevIndex).EndsWith("]]")) // Logic correction suggested by Jeroen Mostert https://stackoverflow.com/users/4137916/jeroen-mostert
            {
                // > appearing in "]]>" must still be escaped
                base.WriteChars(buffer, prevIndex, index - prevIndex + 1);
            }
            else
            {
                base.WriteChars(buffer, prevIndex, index - prevIndex);
                base.WriteRaw(">");
            }
            prevIndex = index + 1;
        }

        if (buffer == null)
            base.WriteString(text);
        else if (prevIndex < buffer.Length)
            base.WriteChars(buffer, prevIndex, buffer.Length - prevIndex);
    }
}

public class XmlWriterDecorator : XmlWriter
{
    // Taken from this answer https://stackoverflow.com/a/32150990/3744182
    // by https://stackoverflow.com/users/3744182/dbc
    // To https://stackoverflow.com/questions/32149676/custom-xmlwriter-to-skip-a-certain-element
    // NOTE: async methods not implemented
    readonly XmlWriter baseWriter;

    public XmlWriterDecorator(XmlWriter baseWriter) => this.baseWriter = baseWriter ?? throw new ArgumentNullException();

    protected virtual bool IsSuspended { get { return false; } }

    public override WriteState WriteState => baseWriter.WriteState;
    public override XmlWriterSettings Settings => baseWriter.Settings;
    public override XmlSpace XmlSpace => baseWriter.XmlSpace;
    public override string XmlLang => baseWriter.XmlLang;
    public override void Close() => baseWriter.Close();

    public override void Flush() => baseWriter.Flush();

    public override string LookupPrefix(string ns) => baseWriter.LookupPrefix(ns);

    public override void WriteBase64(byte[] buffer, int index, int count)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteBase64(buffer, index, count);
    }

    public override void WriteCData(string text)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteCData(text);
    }

    public override void WriteCharEntity(char ch)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteCharEntity(ch);
    }

    public override void WriteChars(char[] buffer, int index, int count)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteChars(buffer, index, count);
    }

    public override void WriteComment(string text)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteComment(text);
    }

    public override void WriteDocType(string name, string pubid, string sysid, string subset)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteDocType(name, pubid, sysid, subset);
    }

    public override void WriteEndAttribute()
    {
        if (IsSuspended)
            return;
        baseWriter.WriteEndAttribute();
    }

    public override void WriteEndDocument()
    {
        if (IsSuspended)
            return;
        baseWriter.WriteEndDocument();
    }

    public override void WriteEndElement()
    {
        if (IsSuspended)
            return;
        baseWriter.WriteEndElement();
    }

    public override void WriteEntityRef(string name)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteEntityRef(name);
    }

    public override void WriteFullEndElement()
    {
        if (IsSuspended)
            return;
        baseWriter.WriteFullEndElement();
    }

    public override void WriteProcessingInstruction(string name, string text)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteProcessingInstruction(name, text);
    }

    public override void WriteRaw(string data)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteRaw(data);
    }

    public override void WriteRaw(char[] buffer, int index, int count)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteRaw(buffer, index, count);
    }

    public override void WriteStartAttribute(string prefix, string localName, string ns)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteStartAttribute(prefix, localName, ns);
    }

    public override void WriteStartDocument(bool standalone) => baseWriter.WriteStartDocument(standalone);

    public override void WriteStartDocument() => baseWriter.WriteStartDocument();

    public override void WriteStartElement(string prefix, string localName, string ns)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteStartElement(prefix, localName, ns);
    }

    public override void WriteString(string text)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteString(text);
    }

    public override void WriteSurrogateCharEntity(char lowChar, char highChar)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteSurrogateCharEntity(lowChar, highChar);
    }

    public override void WriteWhitespace(string ws)
    {
        if (IsSuspended)
            return;
        baseWriter.WriteWhitespace(ws);
    }
}   

And then you could use it e.g. in the following extension method:

public static class XNodeExtensions
{
    public static string ToStringNoEndBracketEscaping(this XNode node)
    {
        if (node == null)
            throw new ArgumentNullException(nameof(node));
        using var textWriter = new StringWriter();
        using (var innerWriter = XmlWriter.Create(textWriter, new XmlWriterSettings { Indent = true, OmitXmlDeclaration = true }))
        using (var writer = new NoEndBracketEscapingXmlWriter(innerWriter))
        {
            node.WriteTo(writer);
        }
        return textWriter.ToString();
    }
}

And now if you do

var newXml = doc.ToStringNoEndBracketEscaping();

The result will be

<Foo Name="a->b"></Foo>

Demo fiddle #2 here.

dbc
  • 104,963
  • 20
  • 228
  • 340
  • 1
    Your logic is not correct. The spec is wrong here in mentioning spaces around " ]]> " -- these are clearly not semantically relevant, as the CDATA specification shows (no spaces there). As a result `]]>` ends up rewritten to `]]>`, which is wrong in two ways: the `>` inside the attribute is escaped even though this is not necessary (OP's requirement), while the `>` inside the text is *not* escaped even though this *is* necessary, resulting in invalid XML. – Jeroen Mostert Jun 23 '22 at 07:14
  • Fortunately fixing this is fairly simple: remove the extraneous space checks and track when we're in an attribute. [Demo](https://dotnetfiddle.net/BIMelq). OP's markup is probably simple enough not to need this, though. – Jeroen Mostert Jun 23 '22 at 07:24
  • @JeroenMostert - thanks for catching that my checks for ` ]]> ` should really have been for `]]>` with no spaces, and that the check wasn't necessary at all for attribute values as they are not "content". I copied the spec too mindlessly in the first case and overlooked *in content* in the second. Answer updated. – dbc Jun 23 '22 at 15:28
  • @JeroenMostert - as far as not escaping `>` in content text, I wasn't sure whether OP wanted that. Since an unescaped `>` in content (e.g. `>`) is well-formed, I added an optional argument `bool onlyForAttributes` to let that be controlled. I also updated my fiddle to make sure that all the XML generated is well-formed by doing `Assert.DoesNotThrow(() => XDocument.Parse(newXml))`. – dbc Jun 23 '22 at 15:29
  • @Jeroen Admittedly my markup is simple enough that I don't have to worry about CData, but it's still nice to have a correct answer on SO for posterity :-) – Voo Jun 25 '22 at 09:48