135

Proper object disposal removed for brevity but I'm shocked if this is the simplest way to encode an object as UTF-8 in memory. There has to be an easier way doesn't there?

var serializer = new XmlSerializer(typeof(SomeSerializableObject));

var memoryStream = new MemoryStream();
var streamWriter = new StreamWriter(memoryStream, System.Text.Encoding.UTF8);

serializer.Serialize(streamWriter, entry);

memoryStream.Seek(0, SeekOrigin.Begin);
var streamReader = new StreamReader(memoryStream, System.Text.Encoding.UTF8);
var utf8EncodedXml = streamReader.ReadToEnd();
Garry Shutler
  • 32,260
  • 12
  • 84
  • 119
  • 1
    I'm confused...isn't the default encoding UTF-8? – flq Oct 05 '10 at 08:47
  • @flq, yes the default is UTF-8, though it doesn't matter much since he's reading it back into a string again so `utf8EncodedXml` is UTF-16. – Jon Hanna Oct 05 '10 at 09:09
  • 1
    @Garry, can you clarify, since Jon Skeet and I are answering different questions. Do you want the object serialised as UTF-8, or do you want an XML string that declares itself as UTF-8, and hence will have the correct declaration when later encoded in UTF-8? (in which case the simplest way is to have no declaration, since that's valid for both UTF-8 and UTF-16). – Jon Hanna Oct 05 '10 at 09:35
  • @Jon Reading back, there is ambiguity in my question. I had it outputting to a string mostly for debugging purposes. In practice I would likely be streaming bytes, either to disk or over HTTP which makes your answer more directly relevant to my problem. The main problem I had was the declaration of UTF-8 in the XML, but to be more accurate I should avoid the intermediary of a string so that I do actual send/persist UTF-8 bytes rather than a platform dependant (I think) encoding. – Garry Shutler Oct 05 '10 at 10:26
  • @Garry: You're unlikely to be sending a platform-dependent encoding unless you specify `Encoding.Default` anywhere. If you can provide more detail on what you're doing, it would help - but if you *can* just stream to bytes, then it would certainly avoid the hassle of the "odd" encoding declaration in a string. – Jon Skeet Oct 05 '10 at 10:31
  • The problem that prompted my question was the need to interact with a Java based web service. At the moment I am sending the request to it using Poster and the serialized, string version of objects. The service was refusing requests due to the UTF-16 declaration in the XML, hence the need to force a UTF-8 declaration. In the programmatic interface to the service I will be streaming the bytes into the request body so will skip any intermediary string-based steps. – Garry Shutler Oct 05 '10 at 10:37
  • @Garry, I think the clause "either to disk or over HTTP" in your comment there justifies the relative verbosity that you complain about; the fact that there are several different things one can do at that point is precisely why it should be flexible in terms of what happens then, and likewise at other points in the process, but this requires multi-stage verbosity so you can change what is happening at each stage. – Jon Hanna Oct 05 '10 at 10:41

4 Answers4

332

No, you can use a StringWriter to get rid of the intermediate MemoryStream. However, to force it into XML you need to use a StringWriter which overrides the Encoding property:

public class Utf8StringWriter : StringWriter
{
    public override Encoding Encoding => Encoding.UTF8;
}

Or if you're not using C# 6 yet:

public class Utf8StringWriter : StringWriter
{
    public override Encoding Encoding { get { return Encoding.UTF8; } }
}

Then:

var serializer = new XmlSerializer(typeof(SomeSerializableObject));
string utf8;
using (StringWriter writer = new Utf8StringWriter())
{
    serializer.Serialize(writer, entry);
    utf8 = writer.ToString();
}

Obviously you can make Utf8StringWriter into a more general class which accepts any encoding in its constructor - but in my experience UTF-8 is by far the most commonly required "custom" encoding for a StringWriter :)

Now as Jon Hanna says, this will still be UTF-16 internally, but presumably you're going to pass it to something else at some point, to convert it into binary data... at that point you can use the above string, convert it into UTF-8 bytes, and all will be well - because the XML declaration will specify "utf-8" as the encoding.

EDIT: A short but complete example to show this working:

using System;
using System.Text;
using System.IO;
using System.Xml.Serialization;

public class Test
{    
    public int X { get; set; }

    static void Main()
    {
        Test t = new Test();
        var serializer = new XmlSerializer(typeof(Test));
        string utf8;
        using (StringWriter writer = new Utf8StringWriter())
        {
            serializer.Serialize(writer, t);
            utf8 = writer.ToString();
        }
        Console.WriteLine(utf8);
    }


    public class Utf8StringWriter : StringWriter
    {
        public override Encoding Encoding => Encoding.UTF8;
    }
}

Result:

<?xml version="1.0" encoding="utf-8"?>
<Test xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <X>0</X>
</Test>

Note the declared encoding of "utf-8" which is what we wanted, I believe.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 2
    Even when you override the Encoding parameter on StringWriter it still sends the written data to a StringBuilder, so it's still UTF-16. And the string can only ever be UTF-16. – Jon Hanna Oct 05 '10 at 09:07
  • 5
    @Jon: Have you tried it? I have, and it works. It's the *declared* encoding which is important here; obviously internally the string is still UTF-16, but that doesn't make any difference until it's converted to binary (which could use any encoding, including UTF-8). The `TextWriter.Encoding` property is used by the XML serializer to determine which encoding name to specify within the document itself. – Jon Skeet Oct 05 '10 at 09:30
  • I tried it and I got a string in UTF-16. Maybe that's what the querant wants. – Jon Hanna Oct 05 '10 at 09:32
  • 3
    @Jon: And what was the declared encoding? In my experience, that's what questions like this are *really* trying to do - create an XML document which declares itself to be in UTF-8. As you say, it's best not to consider the text to be in *any* encoding until you need to... but as the XML document *declares* an encoding, that's something you need to consider. – Jon Skeet Oct 05 '10 at 09:34
  • Yep, I've asked the querant to qualify. I read the question literally, but since the code he gives as an example produces a string maybe your read on it is correct (though in that case I'd suggest not having a declaration at all, since it would then be valid between UTF-8/UTF-16 re-encodings). – Jon Hanna Oct 05 '10 at 09:37
  • @Jon Hanna is there a way to serialize to XML without having a declaration at all? – Garry Shutler Oct 05 '10 at 10:29
  • 2
    @Garry, simplest I can think of right now is to take the second example in my answer, but when you create the `XmlWriter` do so with the factory method that takes an `XmlWriterSettings` object, and have the `OmitXmlDeclaration` property set to `true`. – Jon Hanna Oct 05 '10 at 10:35
  • 4
    +1 Your `Utf8StringWriter` solution is extremely nice and clean – Adriano Carneiro Aug 13 '12 at 19:11
  • @JonSkeet : I checked here the difference between Utf-8 and Utf-16 http://www.differencebetween.net/technology/difference-between-utf-8-and-utf-16/ and found that we should use UTF-8 for encoding. Kindly confirm is it correct. can you please tell u in which situation we should use UTF-16 ? – wuhcwdc Jun 14 '13 at 11:49
  • @JonSkeet - **UTF-16 represents every character using two bytes. UTF-8 uses the one byte ASCII character encodings for ASCII characters**. Does this means when i encode a text file and let's say the text file contains 10 characters. Does it means using UTF-8 => file size will becomes 10 * 8 = 80 charactres. means for each characters, 8 bits will be used. and similarly 10 * 16 = 160 in case of UTF-16. AM i correct ? – wuhcwdc Jun 14 '13 at 11:55
  • @PankajGarg: No, if all those characters are ASCII then the file will be 10 bytes in UTF-8 and 20 bytes in UTF-16. Remember bits != bytes. – Jon Skeet Jun 14 '13 at 16:09
  • Strange that StringWriter needs a subclass to use utf8, why is there no setter... – CRice Aug 02 '14 at 12:19
  • @CRice: I'd have preferred a constructor parameter... but yes, it's a bit annoying. – Jon Skeet Aug 02 '14 at 12:19
  • 1
    Hi @JonSkeet - big fan, but I'm afraid I can't get your Utf8StringWriter to compile in .NET 4.5. I couldn't use `=>` and had to instead create the actual getter with `public override Encoding Encoding { get { return Encoding.UTF8; }}`. Though then it worked a treat! Thanks! – Ian Grainger Jan 21 '16 at 16:14
  • 2
    @IanGrainger: Indeed, that's C# 6 code (it was updated in November to use C# 6, not by me...) – Jon Skeet Jan 21 '16 at 16:16
  • This should be the answer.. The Generated XML Shows the proper UTF encoding with this solution – hanzolo Mar 11 '16 at 19:09
  • very nice solution – Sergei G Oct 23 '20 at 00:35
59

Your code doesn't get the UTF-8 into memory as you read it back into a string again, so its no longer in UTF-8, but back in UTF-16 (though ideally its best to consider strings at a higher level than any encoding, except when forced to do so).

To get the actual UTF-8 octets you could use:

var serializer = new XmlSerializer(typeof(SomeSerializableObject));

var memoryStream = new MemoryStream();
var streamWriter = new StreamWriter(memoryStream, System.Text.Encoding.UTF8);

serializer.Serialize(streamWriter, entry);

byte[] utf8EncodedXml = memoryStream.ToArray();

I've left out the same disposal you've left. I slightly favour the following (with normal disposal left in):

var serializer = new XmlSerializer(typeof(SomeSerializableObject));
using(var memStm = new MemoryStream())
using(var  xw = XmlWriter.Create(memStm))
{
  serializer.Serialize(xw, entry);
  var utf8 = memStm.ToArray();
}

Which is much the same amount of complexity, but does show that at every stage there is a reasonable choice to do something else, the most pressing of which is to serialise to somewhere other than to memory, such as to a file, TCP/IP stream, database, etc. All in all, it's not really that verbose.

Jon Hanna
  • 110,372
  • 10
  • 146
  • 251
  • 6
    Also. If you want to suppress BOM you can use `XmlWriter.Create(memoryStream, new XmlWriterSettings { Encoding = new UTF8Encoding(false) })`. – ony Aug 21 '12 at 11:44
  • If someone (like me) needs to read the XML created like Jon shows, remember to reposition the memory stream to 0, otherwise you'll get an exception saying "Root element is missing". So do this: memStm.Position = 0; XmlReader xmlReader = XmlReader.Create(memStm) – Sudhanshu Mishra Jul 02 '15 at 01:24
17

Very good answer using inheritance, just remember to override the initializer

public class Utf8StringWriter : StringWriter
{
    public Utf8StringWriter(StringBuilder sb) : base (sb)
    {
    }
    public override Encoding Encoding { get { return Encoding.UTF8; } }
}
Sebastian Castaldi
  • 8,580
  • 3
  • 32
  • 24
5

I found this blog post which explains the problem very well, and defines a few different solutions:

(dead link removed)

I've settled for the idea that the best way to do it is to completely omit the XML declaration when in memory. It actually is UTF-16 at that point anyway, but the XML declaration doesn't seem meaningful until it has been written to a file with a particular encoding; and even then the declaration is not required. It doesn't seem to break deserialization, at least.

As @Jon Hanna mentions, this can be done with an XmlWriter created like this:

XmlWriter writer = XmlWriter.Create (output, new XmlWriterSettings() { OmitXmlDeclaration = true });
Eric J.
  • 147,927
  • 63
  • 340
  • 553
Dave Andersen
  • 5,337
  • 3
  • 30
  • 29