16

I'm trying to create a piece of xml. I've created the dataclasses with xsd.exe. The root class is MESSAGE.

So after creating a MESSAGE and filling all its properties, I serialize it like this:

serializer = new XmlSerializer(typeof(Xsd.MESSAGE));
StringWriter sw = new StringWriter();
serializer.Serialize(sw, response);
string xml = sw.ToString();

Up until now all goes well, the string xml contains valid (UTF-16 encoded) xml. Now I like to create the xml with UTF-8 encoding instead, so I do it like this:

Edit: forgot to include the declaration of the stream

serializer = new XmlSerializer(typeof(Xsd.MESSAGE));
using (MemoryStream stream = new MemoryStream())
{
    XmlTextWriter xtw = new XmlTextWriter(stream, Encoding.UTF8);
    serializer.Serialize(xtw, response);
    string xml = Encoding.UTF8.GetString(stream.ToArray());
}

And here comes the problem: Using this approach, the xml string is prepended with an invalid char (the infamous square).
When I inspect the char like this:

char c = xml[0];

I can see that c has a value of 65279.
Anybody has a clue where this is coming from?
I can easily solve this by cutting off the first char:

xml = xml.SubString(1);

But I'd rather know what's going on than blindly cutting of the first char.

Anybody can shed some light on this? Thanks!

fretje
  • 8,322
  • 2
  • 49
  • 61

2 Answers2

17

Here's your code modified to not prepend the byte-order-mark (BOM):

var serializer = new XmlSerializer(typeof(Xsd.MESSAGE));
Encoding utf8EncodingWithNoByteOrderMark = new UTF8Encoding(false);
XmlTextWriter xtw = new XmlTextWriter(stream, utf8EncodingWithNoByteOrderMark);
serializer.Serialize(xtw, response);
string xml = Encoding.UTF8.GetString(stream.ToArray());
Chris W. Rea
  • 5,430
  • 41
  • 58
  • 1
    `XmlTextWriter` has been [deprecated by Microsoft](https://msdn.microsoft.com/en-us/library/system.xml.xmltextwriter.aspx), so nowadays I would do `var xtw = XmlWriter.Create(stream, new XmlWriterSettings { Encoding = utf8EncodingWithNoByteOrderMark });` instead. – dbc Apr 21 '18 at 17:13
7

65279 is the Unicode byte order mark - are you sure you're getting 65249? Assuming it really is the BOM, you could get rid of it by creating a UTF8Encoding instance which doesn't use a BOM. (See the constructor overloads for details.)

However, there's an easier way of getting UTF-8 out. You can use StringWriter, but a derived class which overrides the Encoding property. See this answer for an example.

Community
  • 1
  • 1
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I ran the code and got 65279, too. Probably a typo in the question. – Chris W. Rea Jun 09 '09 at 13:19
  • I don't find creating a new class necessarily *easier*... what I would find easier is that I could *set* the Encoding of a StringWriter without having to derive from it. – fretje Jun 09 '09 at 13:34
  • @fretje: Yes, but deriving a new class is easier than changing the .NET framework :) And the point about deriving a new class being easier than using XmlTextWriter is that you only have to do it in one place, ever. – Jon Skeet Jun 09 '09 at 13:54
  • @Jon: Agreed. I'll take this approach if I ever need this a second time in the same project ;-) – fretje Jun 09 '09 at 14:20