1

I've been struggling with a problem for a few days and have finally worked out what's going wrong but I've only been able to find contradicting answers on StackOverflow (et al) so would like to ask for an explanation of what's going on.

For example this link (in common with many other reference for example this one, or these seemingly go-to references on the topic by Jon Skeet here and here) states that "A string in C# is always UTF-16 [Unicode?], there is no way to "convert" it. The encoding is irrelevant as long as you manipulate the string in memory, it only matters if you write the string to a stream (file, memory stream, network stream...)."

The much simplified Test case I've built to demonstrate my issue is as below, it's probably not copy paste replicable as it depends on some of the strings to have a different encoding, but believe me the test passes as written. I'm using VS2012 Update 4.

The oddity is that the following two lines pass.

Assert.IsFalse(copiedFromXmlDoubleQuote == copiedFromXmlEscapedQuote);
Assert.AreNotEqual(copiedFromXmlDoubleQuote, copiedFromXmlEscapedQuote);

The identical strings fail equivalency as they are encoded differently (copiedFromXmlDoubleQuote had the \ replaced by " in the editor).

All this suggests that the Visual Studio editor is encoding aware, and the strings that the code declares are also encoding aware. My question is, have I done something stupid or can anyone please concur with my findings and if possible refer me to something that will help clarify what the story is with string encoding equivalence... As I'm going to be working in an Xml world a lot is it best practice to explicitly convert everything to Unicode at point of deserialization, and recode it as required when serializing out again?

[TestMethod]
public void EscapedCharacterDoesNotEqualLiteralString()
{
  string actual = "\"";
  Assert.AreEqual("\"", actual);
  Assert.AreEqual(@"""", actual);
  string typedEscapedQuote = "<?xml version=\"1.0\" encoding=\"utf-16\"?>";
  string typedDoubleQuote = @"<?xml version=""1.0"" encoding=""utf-16""?>";
  Assert.IsTrue(typedDoubleQuote == typedEscapedQuote);
  Assert.AreEqual(typedDoubleQuote, typedEscapedQuote);
  string copiedFromXmlEscapedQuote = "<?xml version=\"1.0\" encoding=\"utf-16\"?>";
  string copiedFromXmlDoubleQuote = @"<?xml version=""1.0"" encoding=""utf-16""?>";
  Assert.IsFalse(copiedFromXmlDoubleQuote == copiedFromXmlEscapedQuote);
  Assert.AreNotEqual(copiedFromXmlDoubleQuote, copiedFromXmlEscapedQuote);
  Assert.IsTrue(copiedFromXmlDoubleQuote.ToUnicode() == copiedFromXmlEscapedQuote.ToUnicode());
  Assert.AreEqual(copiedFromXmlDoubleQuote.ToUnicode(), copiedFromXmlEscapedQuote.ToUnicode());
}

private static string BytesToString(byte[] bytes, Encoding encoding)
{
  using (MemoryStream ms = new MemoryStream(bytes))
  {
    using (StreamReader sr = new StreamReader(ms, encoding))
    {
      string s = sr.ReadToEnd();
      sr.Close();
      return s;
    }
  }
}

public static string ToUnicode(this string s)
{
  return BytesToString(new UnicodeEncoding().GetBytes(s), Encoding.Unicode);
}

I've loaded an example Vs2012 sln in a zip here

Community
  • 1
  • 1
9swampy
  • 1,441
  • 14
  • 24
  • 2
    "All this suggests that the Visual Studio editor is encoding aware, and the strings that the code declares are also encoding aware." I'm sure the latter is not the case. I think it's much more likely that there's something odd in the string you've copy/pasted. It's certainly true that the Visual Studio editor has to know what encoding your source code is in - after all, it's stored in a file. I suspect there's an unprintable character in the text you've copied/pasted, e.g. a byte order mark. – Jon Skeet May 01 '14 at 14:11
  • 1
    Oops - thought I'd reproduced it, but I haven't. Can you reproduce it by copying/pasting *from this question* into Visual Studio? Can you put a source file up somewhere that allows us to reproduce it? (Ideally as just a tiny console app rather than a test, just for simplicity.) – Jon Skeet May 01 '14 at 14:16
  • I can give instructions for the next step to debugging the issue, but it'll take more than I can really fit in a comment. Would you like it as an answer? I'm not sure it's *really* an answer, in that it's just the next step, but it should help anyway... – Jon Skeet May 01 '14 at 14:20
  • 2
    I can't reproduce it either. This really looks like an issue with the literals. It is certainly nothing to do with the `Encoding`. As you suggested, encodings are only considered when (de-)serialising. – Gusdor May 01 '14 at 14:22
  • I've saved a zip of a console app example [here](https://spideroak.com/storage/OBZGS3LTONYGSZDFOJXWC2Y/shared/830822-1-1002/StringEquivalenceWithEncoding.zip?5e9d1bd42938ecde99014acec4226421). Help explaining what's gone wrong much appreciated as I'd like to go back to my safe world of having confidence that a string's just a string... – 9swampy May 01 '14 at 19:08

1 Answers1

2

My initial check of your ZIP file shows that

   static string copiedFromXmlEscapedQuote = "<?xml version=\"1.0\" encoding=\"utf-16\"?>";
   static string copiedFromXmlDoubleQuote = @"<?xml version=""1.0"" encoding=""utf-16""?>";

   ? copiedFromXmlEscapedQuote.Length
   39
   ? copiedFromXmlDoubleQuote.Length
   40

The first check for string equivalence in the .net framework is length check - it doesn't bother checking the content if the strings are different lengths.

Further checking;

 ? copiedFromXmlDoubleQuote.Last()
   62 '>'
   ? copiedFromXmlEscapedQuote.Last()
   62 '>'
   ? copiedFromXmlEscapedQuote.First()
   60 '<'
   ? copiedFromXmlDoubleQuote.First()
   65279 ''

So its the first char which is different. The value of 65279 is covered in this article. What is this char? 65279 ''.

It seems you are correct - it is the VS.net editor which is preserving the BOM char, and opening the program file in the binary editor shows these are different, so I'm guessing the use of @ in VS.net tells the compiler to open the following bytes using a different encoder.

Community
  • 1
  • 1
PhillipH
  • 6,182
  • 1
  • 15
  • 25
  • 1
    Christ, that really is grizzly. I would hope VS would prevent you from copying non-spacing characters into the editor. Presumably this only happens with verbatim string literals – Gusdor May 02 '14 at 12:35
  • Thx all for working this out. Hopefully this education will fix the original problem much simplified for the OP... – 9swampy May 02 '14 at 16:03