5

Possible Duplicate:
Determine a string's encoding in C#

I believe if I create a string it defaults to UTF8, however if the string is created else where and I want to be extra safe before dealing with it and check what its encoding is I do not see any easy way to do that using the string or Encoding class. Am I missing something or is a C# string always UTF8 no matter what?

Community
  • 1
  • 1
Rodney S. Foley
  • 10,190
  • 12
  • 48
  • 66
  • Where did you get the idea that strings *have* an encoding or that it defaults to UTF-8? See my answer for more, but I was just wondering where you got that impression... – Jon Skeet Aug 10 '11 at 17:05

1 Answers1

9

Strings in C# (well, .NET) don't have encoding, effectively... or you can view them all as UTF-16, given that they're a sequence of char values, which are UTF-16 code units.

Normally, however, you only need to care about encoding when you convert from a string to a binary form (e.g. down a socket or to a file). At that point, you should specify the encoding explicitly - the string itself has no concept of this.

The only aspect which "defaults" to UTF-8 is that there are plenty of .NET APIs which are overloaded to either accept an encoding or not, and if no encoding is specified, UTF-8 is used. File.ReadAllText is an example of this. However, after reading the file there's no distinction between "text which was read from a UTF-8 file" and "text which was read from a Big5 file" etc.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Surely you mean that the char values in Strings are sequences of 16-bit code **units** not points, since code **points** require 21 bits for full Unicode. I know *you* know better, but the UTF-16 Curse afflicts a lot of other programmers, and every chance to tell it to them straight is worth doing. One cannot store a Unicode character in a 16-bit char; it requires a 32-bit integer for that. – tchrist Aug 12 '11 at 01:23
  • @tchrist: I always forget which way round those two are, sorry - fixed now. I entirely agree it's worth getting it right. One of these days I'll come up with a mnemonic to avoid getting it wrong again... – Jon Skeet Aug 12 '11 at 05:21
  • On the mnemonic, maybe it would help that units have dimensions and points are dimensionless. UTF‐8 has 8‐bit code units, and UTF‐16 has 16‐bit code units, but code points themselves are abstact integers that don’t have bit‐widths. Yeah, ok, so it doesn’t make sense for several units to make up a point. Lemme think on this one for a bit. – tchrist Aug 12 '11 at 05:35
  • @tchrist: Units are building blocks, potentially? Maybe just having this discussion for long enough will help me to remember - but it would be nice to have a suitably pithy mnemonic to propagate to others :) – Jon Skeet Aug 12 '11 at 05:58