3

We have some unit-tests that are checking UTF-8 byte marking of an XML string before it's loaded into an XmlDocument. Everything works fine using Windows 7 64-bit, but we noticed a bunch of tests failing while trying to run under Windows 10 64-bit.

After a bit of investigation, we found that the XML string on Windows 10 is getting pruned (the preamble exists), while on Windows 7 it does not.

Here is the code snippet:

 public static string PruneUtf8ByteMark(string xmlString)
    {
        var byteOrderMarking = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
        if (xmlString.StartsWith(byteOrderMarking))
        {
            xmlString = xmlString.Remove(0, byteOrderMarking.Length);
        }

        return xmlString;
    }

StartsWith is returning true for Windows 10, and false for Windows 7. Note that the same XML string is being used, the only difference here is the OS.

Any ideas? We are a bit lost here, since both PCs are x64 running the same .NET version.

edit: The string comes from a class via:

public static string XmlString = "<?xml version=\"1.0\"....

On Windows 10, the less than sign gets truncated because the byte mark check is true.

d.moncada
  • 16,900
  • 5
  • 53
  • 82

1 Answers1

2

The problem is cause by culture sensitive comparison.

The byteOrderMarking is not a visible character so it will be trimmed during comparison.

See the following case :

"".StartsWith("") // = true
"aa".StartsWith("") // = true 
"aa".StartsWith("", StringComparison.Ordinal) // = true

So every string start with an empty string. Now with byteOrderMarking :

var byteOrderMarking = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
byteOrderMarking.Equals("") // = False
byteOrderMarking.Equals("", StringComparison.CurrentCulture) // = True
byteOrderMarking.Equals("", StringComparison.Ordinal) // = False

Now we can see that byteOrderMarking is equal to an empty string only with Current culture comparison. When you try to check is a string start with byteOrderMarking, it's like to compare to an an empty string.

The difference between Ordinal and CurrentCulture is that the first is a byte to byte comparison, whereas the second will by normalize according to the culture.

Lastly, I suggest to always use Ordinal (or OrdinalIgnoreCase) to compare technical strings.

Kalten
  • 4,092
  • 23
  • 32
  • Thanks for the answer. Yeah, I understand why to use Ordinal, but I still do not understand why the string comparison then is different across OS versions. – d.moncada Feb 06 '17 at 23:00
  • In win10 they add many more new supported languages. That can be related. The .Net framework depend often on windows api. So if the os change, the framework could be affected. – Kalten Feb 06 '17 at 23:09