14

Edit: I originally thought this was related to .NET Framework 4.5. Turned out it applies to .NET Framework 4.0 as well.

There's a change in how strings are handled in Windows Server 2012 which I'm trying to understand better. It seems like the behavior of StartsWith has changed. The issue is reproducible using both .NET Framework 4.0 and 4.5.

With .NET Framework 4.5 on Windows 7, the program below prints "False, t". On Windows 2012 Server, it prints "True, t" instead.

internal class Program
{
   private static void Main(string[] args)
   {
      string byteOrderMark = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
      Console.WriteLine("test".StartsWith(byteOrderMark));
      Console.WriteLine("test"[0]);
   }
}

In other words, StartsWith(ByteOrderMark) returns true regardless of string content. If you have code which attempts to strip away the byte order mark using the following method, this code will work fine with on Windows 7 but will print "est" on Windows 2012.

internal class Program
{
  private static void Main(string[] args)
  {
     string byteOrderMark = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
     string someString = "Test";

     if (someString.StartsWith(byteOrderMark))
        someString = someString.Substring(1);

     Console.WriteLine("{0}", someString);
     Console.ReadKey();

  }

}

I realize that you have already done something wrong if you have byte order markers in a string, but we're integrating with legacy code which has this. I know I can solve this specific issue by doing something like below, but I want to understand the problem better.

someString = someString.Trim(byteOrderMark[0]);

Hans Passsant suggested using the constructor of UTF8Encoding which lets me tell it explicitly to emit UTF8 identifier. I tried this, but it gives the same result. The below code differs in output between Windows 7 and Windows Server 2012. On Windows 7, it prints "Result: False". On Windows Server 2012 it prints "Result: True".

  private static void Main(string[] args)
  {
     var encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: true);
     string byteOrderMark = encoding.GetString(encoding.GetPreamble());
     Console.WriteLine("Result: " + "Hello".StartsWith(byteOrderMark));
     Console.ReadKey();
  }

I've also tried the following variant, which prints False, False, False on Windows 7 but True, True, False on Windows Server 2012, which confirms it's related to the implementation of StartsWith on Windows Server 2012.

  private static void Main(string[] args)
  {
     var encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: true);
     string byteOrderMark = encoding.GetString(encoding.GetPreamble());
     Console.WriteLine("Hello".StartsWith(byteOrderMark));
     Console.WriteLine("Hello".StartsWith('\ufeff'.ToString()));
     Console.WriteLine("Hello"[0] == '\ufeff');

     Console.ReadKey();
  }
Nitramk
  • 1,542
  • 6
  • 25
  • 42
  • I wouldn't even use `Trim` - if you're only worried about the *first character* then just checking whether `text[0] == '\ufeff'` would be good enough (with suitable handling for an empty string). Does seem odd though. – Jon Skeet Oct 21 '13 at 13:07
  • Sure, that would work better than Trim. Assume TrimStart would work good as well. Still I'm mostly trying to understand why this has changed in the first place. Many of the up-voted responses on this site suggest to check first with StartsWith() and that code will break down when run on Windows Server 2012 with .NET Framework 4.5. Example here: http://stackoverflow.com/questions/1317700/strip-byte-order-mark-from-string-in-c-sharp. – Nitramk Oct 21 '13 at 13:21

1 Answers1

16

Turns out I could repro this, running the test program on Windows 8.1. It is in the same "family" as Server 2012.

The most likely source of the problem is that the culture sensitive comparison rules have changed. They can be, erm, flaky and can have odd outcomes on these kind of characters. The BOM is a zero-width space. Reasoning this out requires the same kind of mental gymnastics as understanding why "abc".StartsWith("") returns true :)

You need to solve your problem by using StringComparison.Ordinal. This produced False, False, False:

private static void Main(string[] args) {
    var encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: true);
    string byteOrderMark = encoding.GetString(encoding.GetPreamble());
    Console.WriteLine("Hello".StartsWith(byteOrderMark, StringComparison.Ordinal));
    Console.WriteLine("Hello".StartsWith("\ufeff", StringComparison.Ordinal));
    Console.WriteLine("Hello"[0] == '\ufeff');
    Console.ReadKey();
}
Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • It turns out the issue is reproducible for me with .NET Framework 4.0 as well. I updated my main post to include a new snippet based on your suggestion, but it gives different behavior between Windows 7 and Windows Server 2012. I've also added a snippet where I hard-code \ufeff and passes it to StartsWith and the behavior seen differs between Windows 7 and Windows 2012 Server. I agree that this is weird, but it's what I'm seeing. – Nitramk Oct 21 '13 at 13:53
  • You guys saved my bacon! BTW I was able to get away with simply `string byteOrderMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());` on Windows Server 2012, followed by your crucial `StartsWith(byteOrderMarkUtf8, StringComparison.Ordinal)`. – snark May 16 '16 at 11:07