2

My goal is to get a binary buffer (MemoryStream.ToArray() would yield byte[] in this case) of XML without losing the Unicode characters. I would expect the XML serializer to use numeric character references to represent anything that would be invalid in ASCII. So far, I have:

using System;
using System.IO;
using System.Text;
using System.Xml;

class Program
{
    static void Main(string[] args)
    {
        var doc = new XmlDocument();
        doc.LoadXml("<x>“∞π”</x>");
        using (var buf = new MemoryStream())
        {
            using (var writer = new StreamWriter(buf, Encoding.ASCII))
                doc.Save(writer);
            Console.Write(Encoding.ASCII.GetString(buf.ToArray()));
        }
    }
}

The above program produces the following output:

$ ./ConsoleApplication2.exe
<?xml version="1.0" encoding="us-ascii"?>
<x>????</x>

I figured out how to tell XmlDocument.Save() to use encoding="us-ascii"—by handing it a TextStream with TextStream.Encoding set to Encoding.ASCII. The documentation says The encoding on the TextWriter determines the encoding that is written out. But how can I tell it that I want it to use numeric character entities instead of its default lossy behavior? I have tested that doc.Save(Console.OpenStandardOutput()) writes the expected data (without an XML declaration) as UTF-8 with all of the correct characters, so I know that doc contains the information I wish to serialize. It’s just a matter of figuring out the right way to tell the XML serializer that I want encoding="us-ascii" with character entities…

I understand that it may be non-trivial to write XML documents that are both encoding="us-ascii" and supportive of constructs like <π/> (I think this one might only be doable with external document type definitions. Yes, I have tried just for fun.). But I thought it was quite common to output entities for non-ASCII characters in an ASCII XML document to support preservation of content and attribute value character data in Unicode-unfriendly environments. I thought that numeric character references representing Unicode characters was analogous to using base64 to protect a blob while keeping the content more readable. How do I do this with .NET?

Mark J. Bobak
  • 13,720
  • 6
  • 39
  • 67
binki
  • 7,754
  • 5
  • 64
  • 110
  • If you're just checking through Console, you might want to check Console.OutputEncoding. – tweellt Mar 14 '14 at 03:01
  • @tweellt But my goal was to serialize the XML to something that would survive in ASCII (which would imply it could survive whatever encoding Console.OutputEncoding is set to on an English system). – binki Mar 14 '14 at 03:43

1 Answers1

6

You can use XmlWriter instead:

  var doc = new XmlDocument();
    doc.LoadXml("<x>“∞π”</x>");
    using (var buf = new MemoryStream())
    {
        using (var writer =  XmlWriter.Create(buf, 
              new XmlWriterSettings{Encoding= Encoding.ASCII}))
        {
            doc.Save(writer);
        }
        Console.Write(Encoding.ASCII.GetString(buf.ToArray()));
    }

Outputs:

<?xml version="1.0" encoding="us-ascii"?><x>&#x201C;&#x221E;&#x3C0;&#x201D;</x> 
Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179