1

I know that there is a lot of tutorials about this and even answered questions here, but I have problem I'm trying to resolve for hours and I read almost everything here, but this still remains mistery for me. Please help:

I'm creating XML, and it's created, but the problem is that encoding is UTF-16, and it should be UTF-8. This is what I tried so far, but still is UTF-16:

        var xmlText = new StringBuilder();

        using (var xml = XmlWriter.Create(xmlText))
        {
            xml.WriteStartDocument();
            xml.WriteStartElement("Weather");


            if (model.ModuleList[0] != null)
            {
                foreach (var weather in model.ModuleList)
                {
                    var AddProperty = new Action<XmlWriter, ModuleModel>((a, forc) =>
                    {
                        xml.WriteStartElement("Forecast");
                        a.WriteElementString("Description", forc.Description);
                        a.WriteElementString("Date", forc.Date.ToString());
                        a.WriteElementString("MinTemp", forc.Min_Temp.ToString());
                        a.WriteElementString("MaxTemp", forc.Max_Temp.ToString());
                        a.WriteElementString("Pressure", forc.Pressure.ToString());
                        a.WriteElementString("Humidity", forc.Humidity.ToString());                           
                        xml.WriteEndElement();
                    });
                    AddProperty(xml, weather);
                }
            }             

            xml.WriteEndElement();
            xml.WriteEndDocument();
        }
        var xmlresult = xmlText.ToString();

How to set encoding to my XML to UTF-8? Please help...

Edna
  • 132
  • 4
  • 13
  • 1
    You're writing the XML into a `StringWriter`. It's somewhat obvious that the encoding of a `StringWriter` would be the native string encoding that C# uses internally, which is UTF-16. Just write your XML to a UTF-8 `Writer`. – millimoose Sep 06 '13 at 23:20
  • So I need to try different approach...will do... thanks. – Edna Sep 06 '13 at 23:26
  • @millimoose: nitpicking - strings do not have an encoding – MiMo Sep 06 '13 at 23:32
  • @MiMo Seeing how strings ultimately have to be stored in a sequence of bytes, they should have *some* internal encoding, unless C# stores 32-bit code points directly. Since `char` is 2-bytes which does not cover all of Unicode, I doubt this. – millimoose Sep 06 '13 at 23:33
  • Write it to a [`MemoryStream`](http://msdn.microsoft.com/en-us/library/system.io.memorystream.aspx), and set the encoding explicitly using [`XmlWriterSettings`](http://msdn.microsoft.com/en-us/library/system.xml.xmlwritersettings.aspx) with the [`XmlStream.Create(Stream, XmlWriterSettings)`](http://msdn.microsoft.com/en-us/library/ms162617.aspx) overload. – millimoose Sep 06 '13 at 23:38
  • @MiMo Quoting [The Skeet](http://csharpindepth.com/Articles/General/Strings.aspx): " a single `char` (`System.Char`) cannot cover every character. This leads to the use of surrogates where characters above U+FFFF are represented in strings as two characters. Essentially, string uses the UTF-16 character encoding form." I'll see if I can show some example code where this leaks through the API. – millimoose Sep 06 '13 at 23:41
  • @millimoose: internally there must be surely some encoding (even a full 32 bits value for each character is an encoding) - and maybe it even leaks somewhere, but it is an implementation detail - encoding appear when you write to file (or convert to sequence of bytes) - and in this case the default is actually UTF-8, not UTF-16. – MiMo Sep 06 '13 at 23:45
  • @MiMo It really isn't just an implementation detail. For instance, `"\U0001F4A9".Length` is 2, which is incorrect. I also can't really find convenient methods that would let you work with `int` codepoints. I agree that you probably couldn't find enough programmers in the world who really need to care to get into the triple digits, but there you go. – millimoose Sep 06 '13 at 23:48
  • @MiMo For comparison, Python 3 does this correctly: http://ideone.com/NzBtoG (Since Python 3.3 it uses 32-bit code points internally.) Java also uses 16-bit `char`s but it comes with a few methods that let you work with code points directly: http://ideone.com/Gbuygs. The .NET docs are also pretty up-front about using an internal encoding: "Each code point is encoded by using UTF-16 encoding, and the numeric value of each element of the encoding is represented by a Char object." (http://msdn.microsoft.com/en-us/library/system.string.aspx#Characters) – millimoose Sep 07 '13 at 00:03

1 Answers1

1

The result of your code is a string xmlresult - and strings do not have an encoding, they are always Unicode.

You use an encoding when you convert a string to a sequence of byte - so your problem is not in the piece of code you posted, but in the code you use to write that string to a file.

Something like this:

 using (StreamWriter writer = new StreamWriter(fileName, true, Encoding.UTF8))
 {
     writer.Write(xmlresult);
 }

will write a UTF-8 file - where filename contains the path of the file.

If you need UTF-8 encoded bytes in memory use:

var utf8Bytes = Encoding.UTF8.GetBytes("xmlresult");
MiMo
  • 11,793
  • 1
  • 33
  • 48
  • I actually need XML in UTF8 to store it in database. I'll try different approach then. I'm doing this first time so everything is new for me... Appreciate your help, thanks. – Edna Sep 06 '13 at 23:26
  • Can you please tell me what is fileName in your example above? – Edna Sep 06 '13 at 23:28
  • Which is the data type of the database column you need to write the XML to? Which database are you using? – MiMo Sep 06 '13 at 23:31
  • @user2710923 Most databases should support Unicode strings / text fields. – millimoose Sep 06 '13 at 23:31
  • I'm using SQL Server, and I set up XML type for this column. – Edna Sep 06 '13 at 23:32
  • 1
    The XML type in SQL Server handles the encoding internally - you don't have to do any conversion - and actually you don't even need to build a string, you can use an XmlReader directly. – MiMo Sep 06 '13 at 23:39
  • Write now in my database I have for this column (when I get SELECT * FROM...) Syste.XML.XmlDocument. How do I know what elements does it have so I could check if it's OK? – Edna Sep 07 '13 at 00:39
  • http://stackoverflow.com/questions/4815836/how-do-you-read-xml-column-in-sql-server-2008 – MiMo Sep 07 '13 at 03:00