0

This method is writing out an XML file (work specific). I have everything writing out exactly was I want it except that that I set it to write the file with UTF-8 (no BOM) encoding.

The XML declaration says UTF-8, but when I open the file in Notepad++, it shows to be encoded in ANSI.

        XmlWriterSettings settings = new XmlWriterSettings();
        settings.Indent = true;
        settings.Encoding = new UTF8Encoding(false);
        settings.NewLineOnAttributes = true;


        using (var xmlWriter = XmlWriter.Create(@"c:\temp\myUIPB.xml", settings))
        {
            xmlWriter.WriteStartDocument();
            xmlWriter.WriteStartElement("UIScript");

            // Write Event Nodes
            foreach (var eventNode in listBoxOutput.Items)
            {
                lbEvent myNode = (lbEvent)eventNode;
                XmlNode xn = myNode.workflowEvent;
                xn.WriteTo(xmlWriter);
            }

            xmlWriter.WriteFullEndElement();
            xmlWriter.WriteEndDocument();
            xmlWriter.Flush();
            xmlWriter.Close();
        }

I would expect that if I set it to output in UTF-8, that the file that writes out is indeed encoded in UTF-8 instead of ANSI encoded.

Thoughts? Help?

kingtermite
  • 55
  • 2
  • 9
  • Possible duplicate of [XmlWriter encoding UTF-8 using StringWriter in C#](https://stackoverflow.com/questions/42583299/xmlwriter-encoding-utf-8-using-stringwriter-in-c-sharp) – MethodMan Dec 02 '17 at 00:45
  • 2
    You chose to omit the [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) by using [`new UTF8Encoding(false)`](https://msdn.microsoft.com/en-us/library/s064f8w2(v=vs.110).aspx). Maybe the XML file really is encoded in utf8 but Notepad++ is guessing wrongly due to the missing BOM? What happens if you try to emit a Kanji character on a supplementary Unicode plane such as ? Is it correctly encoded, or escaped? – dbc Dec 02 '17 at 00:57
  • In fact I found [this](https://github.com/adobe/brackets/issues/10583#issuecomment-168409391) on github which seems relevant: *Notepad++ has no way of knowing the content encoding so it has to guess. It sees only ASCII so it assumes the lowest common denominator (which basically on Windows is ASCII + foreign language extensions, e.g. Windows-1252).* – dbc Dec 02 '17 at 01:03
  • Thanks. I did wonder if Notepad++ was getting it wrong, so I cross-checked it in Windows Notepad. It also gave "ANSI". – kingtermite Dec 02 '17 at 01:07
  • 1
    Does your file have any characters in it that **aren't** ANSI? – mjwills Dec 02 '17 at 01:08
  • I don't believe so. The file is being created from nodes it's reading from another xml file (that was encoded in UTF-8). In fact, this output file is just a subset of the nodes it read in from that file. – kingtermite Dec 04 '17 at 19:22

1 Answers1

2

File using Utf8 without BOM and ascii encoding look identical if it contains just Latin characters and numbers.

A generic text editing program (like notepad, notepad++) will be able to guess encoding the way you like (unless you provide some hints, usually with "Open with encoding" file open options).

Compliant XML parsers use "encoding" part of "xml" PI (<?xml version="1.0" encoding="UTF-8"?>) to detect correct encoding for files without BOM. In your case you'll likely getting correct "xml" PI and compliant XML parser will open it correctly.

If you need all programs to detect Utf8 correctly specify BOM by passing true to encodings constructor.

Note that without BOM file even with characters with code above 128 may have its encoding detected incorrectly.

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
  • 2
    XML has rules about determining the declared encoding. But, a normal text editor doesn't know about XML. Notepad++ plugins do have a lot of XML feature but they don't appear to help Notepad++ itself use the declared encoding. – Tom Blodget Dec 04 '17 at 00:13
  • Thanks. I know the system that will be reading the file I'm outputting will not read it in correctly if not UTF-8 (without BOM). It shouldn'be be using any characters with codes above 128. It's pretty simple stuff. – kingtermite Dec 04 '17 at 19:24
  • Also...with regards to giving it a 'hint'. I *thought* that's what setting the encoding to UTF-8 without BOM was doing. Is there any other way I can give it a hint? – kingtermite Dec 04 '17 at 19:27
  • 1
    @kingtermite if whatever tool you are generating files for can't handle BOM there is a good chance that it will ignore encoding in XML directive too (as Tom Blodget commented `encoding="..."` is the way compliant XML parser should pick encoding if there is no BOM)... So as long as whatever output you get is parsed correctly by tool you are targeting it is fine... (I'll edit post to clarify XML vs. text difference) – Alexei Levenkov Dec 04 '17 at 19:30
  • Thank you Alexei. You were right. The file was actually fine all along and the system read it as it was compliant. I had not even tested it out in system as my prior experience led me to believe that if the encoding didn't specifically show "UTF-8" (without BOM) that it would not work. – kingtermite Dec 04 '17 at 20:28