0

I am attempting to modify a utf-16 encoded XML file in C# (specifically Unity 2017.4.33f1).

EDIT: Turns out the original file specified a utf-8 encoding!

I am loading the document using this code:

using (FileStream fileStream = new FileStream(inPath, FileMode.Open, FileAccess.Read))
{
   _Document = XDocument.Load(fileStream);
}

When inspecting the object from a debugger, the XDocument seems to have loaded the declaration of the document as UTF-8, even though the original document specifies UTF-16.

Debugger view of XDocument

Why is this happening? Is there any way to stop the XDocument from changing the encoding when loading a file?

Yuzu
  • 73
  • 6
  • 2
    When loaded into memory, the content is always in UTF-16. The declared encoding comes into play only when serializing to storage. And it's entirely possible to declare one encoding in the XML header and then save the XML to a file with another encoding (you should not do that, but you can). What encoding does your file actually contain? – GSerg Oct 17 '19 at 19:45
  • @GSerg Ah! Right you are. The original file is indeed utf-8 encoded. The fact that XDocument reports a UTF-16 encoding via the ToString is confusing. – Yuzu Oct 17 '19 at 19:50
  • @GSerg Would you mind posting your comment as an answer so I could mark it as the answer? – Yuzu Oct 17 '19 at 19:55
  • Add fileStream.ReadLine() before using the Load method. The XDocument does not like the utf-16 and does not need the xml identification line. I've done this plenty of times in the past. – jdweng Oct 17 '19 at 20:50
  • @jdweng What you are saying is that plenty of times in the past you have done the wrong thing instead of fixing your XML files saved with the wrong encoding. That is not a good advice. – GSerg Oct 17 '19 at 20:54
  • @Yuzu I believe my very brief comment is not worth becoming an answer. If anything, https://stackoverflow.com/a/16404493/11683 would be a much more detailed answer. – GSerg Oct 17 '19 at 21:01
  • It is not wrong according to the XML specification. The Net library is wrong and ReadLine() is the easiest way of getting rid of the error. – jdweng Oct 17 '19 at 21:04
  • Possibly a duplicate of [Force XDocument to write to String with UTF-8 encoding](https://stackoverflow.com/q/3871738/3744182) or [XDocument XDeclaration not appearing in ToString result](https://stackoverflow.com/q/28183461/3744182). Agree? – dbc Oct 18 '19 at 01:06
  • @jdweng It is wrong to save XML that declares itself as utf-16 into a file using utf-8 encoding. Not only it is wrong according to the common sense, it is also wrong according to the XML specification. As quoted in the [mentioned answer](https://stackoverflow.com/a/16404493/11683), *"it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration"*. – GSerg Oct 18 '19 at 07:53
  • Things have changed since 2013. More people are using xml with utf-16. I have seen lots of cases where the Net library is not working correctly and handling xml with utf-16 is one of them. – jdweng Oct 18 '19 at 08:44
  • 1
    @jdweng So a long time ago, you had an invalid XML file and you could not figure what was wrong and how to properly fix it. Instead you became certain that [.NET is broken](https://blog.codinghorror.com/the-first-rule-of-programming-its-always-your-fault/), and came up with a "fix". It is perfectly fine to save XML in utf-16, provided both the XML declaration and the [file encoding](http://www.joelonsoftware.com/articles/Unicode.html) are indeed utf-16. I strongly suggest that you visit both links instead of passing along the advice to do the wrong thing. – GSerg Oct 18 '19 at 20:16
  • @GSerg I disagree. I think it's important to note that the encoding that will be returned by .NET if you ToString an XDocument is always UTF-16, regardless of the encoding the original document specified. That's the exact kind of missing documentation I come to SO for. I'm not sure how the answer you linked is relevant? If anything, shouldn't XDocument internally represent XML as UTF-8 if UTF-8 is the encoding standard for XML? – Yuzu Oct 21 '19 at 19:04
  • @jdweng: definitely agreeing with GSerg here. Ditching a [well-defined part of the XML specification](https://www.w3.org/TR/xml/#sec-prolog-dtd) is not acceptable for any piece of software. – Yuzu Oct 21 '19 at 19:16
  • @dbc: Disagree. I specified an encoding (via the XML declaration in the file) and ToString is writing a different encoding than the one I specified. – Yuzu Oct 21 '19 at 19:18
  • @Yuzu All strings in .NET are UTF-16 like I noted in the first comment. `ToString()` returns a string, so it has to be in UTF-16. Because this is a global fact that applies to all strings and not just XML, it is [documented in an appropriate place](https://learn.microsoft.com/en-us/dotnet/api/system.string?view=netframework-4.8). After `ToString` has returned, the generated string value has no way to know that it was created from an XML, it's just a string, and as such when you save it somewhere, it cannot enforce another encoding like `xmlDoc.Save` would be able to. – GSerg Oct 21 '19 at 19:31
  • @Yuzu - this is Microsoft's design intent. When writing an `XDocument` the **encoding of the `XmlWriter`** is used to generate the XML declaration, not any pre-existing `XDeclaration` left around from when the `XDocument` was read in. If you don't want that you will need to create an extension method on `XDocument` that generates an XML string with your preferred XML declaration. Do you want that? I think we can tell you how to do that, if that's what you want. (Of course that means the XML declaration will be inconsistent with the actual encoding...) – dbc Oct 21 '19 at 19:36
  • @GSerg You did not note that in your original comment. There was no link to that particular piece of .NET documentation. You said "When loaded into memory, the content is always in UTF-16", which is not the same as "You are asking .NET to generate a string representation of that XDocument, [strings in .NET are always UTF-16](https://learn.microsoft.com/en-us/dotnet/api/system.string?view=netframework-4.8), therefore it renders it as a UTF-16 string and changes the encoding to match." This answers my question. Why do you not want to report this as an answer? – Yuzu Oct 25 '19 at 18:26
  • 1
    Well, XML content is a collection of pieces of text, each of which is a `string`, and those are always in UTF-16. I'm reluctant to post an answer because I don't know *for sure* what happens and why, whether it is documented or is a bug, and whether it is reliable or may change in the future. E.g. the answer I'm referring to explains that it is a fatal error to have different encodings in the file and in the XML declaration, yet the .NET parser *evidently* chooses to load it anyway and, *evidently*, slips in a modified declaration node. I cannot comment if that is correct and stable behaviour. – GSerg Oct 25 '19 at 19:42
  • @dbc So, someone has answered my question, but I'll clarify why this is confusing to me: `XDocument` correctly reports that the encoding of the document I loaded was UTF-8. I expect that `ToString()` returns a representation of the object I am calling it on. `XDocument.ToString()` re-encodes my XML document to UTF-16 and outputs that. If I had an `Image` object, loaded a PNG into it, called `ToString()` and the output string told me the encoding of the image I loaded was RGBA32, I'd be confused. – Yuzu Oct 25 '19 at 19:48
  • 1
    @GSerg Ok, that's fair. I'll provide an answer for now. Thank you for your input though! It helped a lot. – Yuzu Oct 25 '19 at 19:49

1 Answers1

0

tl;dr: Use XDocument.Save() and its overloads

Based on discussion within the comments of the question, this seems to be the behavior of Unity's 2017.4.33f1's .NET implementation:

XDocument.ToString() will encode the document to UTF-16 and output that XML as a string and change the in-document encoding declaration to utf-16, regardless of the encoding specified in the object/source file. .NET strings are always UTF-16 encoded, so this is the likely source of this behavior. .NET is outputting valid XML, but not XML that accurately reflects the XDocument object ToString() was called on. This means that code like:

XDocument doc = XDocument.Load(path); 
System.Encoding enc = System.Encoding.GetEncoding(doc.Declaration.Encoding);
System.IO.File.WriteAllText(path, doc.ToString(), enc);

will write invalid XML if the document was not originally UTF-16 encoded.

XDocument.Save(string path) respects the encoding specified in XDocument.Declaration and will save the file with that encoding.

Yuzu
  • 73
  • 6