3

I am creating a Xml file with the following code (the byte array returned by Serialize() is written to a FileStream later):

    public byte[] Serialize()
    {
        using (var stream = new MemoryStream())
        {
            WriteXmlToStream(stream);

            stream.Position = 0;                

            using (var reader = new StreamReader(stream))
            {
                string resultString = reader.ReadToEnd();
                return Encoding.UTF8.GetBytes(resultString);
            }
        }
    }

    private void WriteXmlToStream(MemoryStream stream)
    {
        var document = 
            new XDocument(
                new XElement("Coleta",
                    new XElement("Operador", Operador),
                    new XElement("Sujeito", Sujeito),
                    new XElement("Início", DataHora.ToString(Constantes.FormatoDataHora)),
                    new XElement("Descrição", Descrição),
                    // and so on
                    )
                )
            );

        document.Save(stream);
    }

But when I open the saved file, the unicode characters are "wrong":

<?xml version="1.0" encoding="utf-8"?>
    <Coleta>
      <Operador>Nome do Operador do Sofware</Operador>
      <Sujeito>Nome Paciente de Teste</Sujeito>
      <Início>2015-05-19T02:24:10.10Z</Início>
      <Descrição>Coleta de teste para validação do formato de arquivo.</Descrição>
      <Sensores>
        <SensorInfo>
          <Sensor>
            <Nome />
            <PosiçãoAnatômica>NãoEspecificada</PosiçãoAnatômica>
            <Canais>
              <Canal>
              <!-- and so on -->

So what am I not doing, or doing wrong, and how should I fix it? I always have a hard time understanding these encoding peculiarities.

As mentioned in the comments, it happens because file editors are not opening the generated file with the correct (utf-8) encoding.

So my question is: how should I force encoding to the file?

UPDATE: it seems like this answer might be relevant:

https://stackoverflow.com/a/3871822/401828

Community
  • 1
  • 1
heltonbiker
  • 26,657
  • 28
  • 137
  • 252
  • 1
    How are you opening the file? It's probably just Notepad being rubbish... make sure you open it in UTF-8. – Jon Skeet May 19 '15 at 14:42
  • I've tried this and it works fine, so I can only re-iterate what Jon Skeet says above - how are you verifying the output? – Charles Mager May 19 '15 at 14:47
  • re the `new StreamReader()` etc - any reason you can't just use `return stream.ToArray();` ? – Marc Gravell May 19 '15 at 14:51
  • @JonSkeet Well your tips worked: I opened it in Sublime Text, then clicked "File->Reopen with encoding-> utf-8, and problem is fixed. Now the question is: how can I create the file so that Text Editors know automatically it's encoded as utf-8? – heltonbiker May 19 '15 at 14:52
  • @MarcGravell it has to do with the `Encoding.UTF8.GetBytes(resultString);` part, but this is legacy already, since I was not using XDocument before. Gonna try your suggested change, thanks. – heltonbiker May 19 '15 at 14:54
  • @heltonbiker frankly, that's the wrong way (and too late) to try and apply encoding; if you want a specific encoding, it would be better to use a TextWriter when writing the data; however, I strongly suspect it defaults to UTF-8 *anyway*, so it shouldn't need anything – Marc Gravell May 19 '15 at 14:56
  • I've looked it up, and the encoding is picked out from the `.Declaration` – Marc Gravell May 19 '15 at 14:58
  • @MarcGravell I tried your simplification, but now it saves a lengh-encoded string (with three bytes before the actual string), and that's not what I need. As I said, this is something (encodings, etc.) I have not yet got familiar with, so I appreciate any suggestion. The structure of the code in my program only requires me that the `Serialize()` method returns a byte array representing a XML string encoded as UTF-8. – heltonbiker May 19 '15 at 14:58
  • 1
    @heltonbiker I very much doubt that is length encoding; more likely: that's a BOM (the sequence 0xEF,0xBB,0xBF). Do you have a strong objection to a BOM? But give me a sec, I'll see what I can find – Marc Gravell May 19 '15 at 15:01
  • BTW: you should *not* attempt to format dates manually for xml – Marc Gravell May 19 '15 at 15:03

1 Answers1

3

If you want fine-grained encoding control, you probably want to control the TextWriter; for example, in the example below I'm using UTF-8 sans-BOM. However, if possible, you could also write directly to a file via a FileStream...

using System;
using System.IO;
using System.Text;
using System.Xml.Linq;


class Program
{
    static void Main()
    {
        var bytes = new Program().Serialize();
        File.WriteAllBytes("my.xml", bytes);
    }
    public byte[] Serialize()
    {
        using (var stream = new MemoryStream())
        {
            WriteXmlToStream(stream);

            return stream.ToArray();
        }
    }

    private void WriteXmlToStream(Stream stream)
    {
        var document =
            new XDocument(
                new XElement("Coleta",
                    new XElement("Operador", "foo"),
                    new XElement("Sujeito", "bar"),
                    new XElement("Início", DateTime.Now),
                    new XElement("Descrição", "Descrição")
                    // and so on
                    )
                );
        using (var writer = new StreamWriter(stream, new UTF8Encoding(false)))
        {
            document.Save(writer);
        }
    }
}

The above works fine, and encodes correctly.

To write directly to a file instead:

public void Serialize(string path)
{
    using (var stream = File.Create(path))
    {
        WriteXmlToStream(stream);
    }
}
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • 1
    @heltonbiker side note, and to repeat: please don't format the `DateTime`s yourself: let the xml api worry about the correct format for dates. – Marc Gravell May 19 '15 at 15:11