0

In this Text, does the Dot '•' count as a valid UTF8 Character, even though it takes 3 bytes unlike the other characters which are single-byte each?

ABCDEFGHIJ•XYZ
BeeGees
  • 451
  • 3
  • 10
  • 2
    Every Unicode character is a valid UTF-8 character. In UTF-8 encoding a Unicode character can be between 1 and 4 bytes long. – ckuri Feb 13 '20 at 12:10
  • Thanx for pointing out. But is there a calcultion like if the value is between so and so and so and so then those are utf8 characters – BeeGees Feb 13 '20 at 12:18
  • @BeeGees The [Wikipedia page for UTF-8](https://en.wikipedia.org/wiki/UTF-8) has a good description of how it works. Code points between U+0000 and U+007F use a single byte. Those between U+0080 and U+07FF use two bytes. And so on – canton7 Feb 13 '20 at 12:21
  • Also with regards to your [previous question](https://stackoverflow.com/questions/60202410/why-is-streamreader-and-sr-basestream-seek-giving-junk-characters-even-in-utf8), UTF-8 is a [self-synchronizing code](https://en.wikipedia.org/wiki/Self-synchronizing_code#Note), and you can check if a byte `b` is the start of an UTF-8 character if `b & 0b1000_0000 == 0 || b & 0b1100_0000 == 0b1100_0000`. – ckuri Feb 13 '20 at 12:21
  • Can i get a link to some working C# code which checks very fast whether a text is all valid UTF8 characters – BeeGees Feb 13 '20 at 12:37
  • 1
    @BeeGees `new UTF8Encoding(false, true).GetString(bytes);` throws an exception if `bytes` is not valid UTF-8 – canton7 Feb 13 '20 at 12:45
  • For knowledge What does encoderShouldEmitUTF8Identifier = true means (which you have said false here) – BeeGees Feb 13 '20 at 13:01
  • @BeeGees Please read [the docs](https://learn.microsoft.com/en-us/dotnet/api/system.text.utf8encoding.-ctor?view=netframework-4.8#System_Text_UTF8Encoding__ctor_System_Boolean_System_Boolean_) – canton7 Feb 13 '20 at 13:18
  • @canton7 I know what it does. Point is why would you want to emit the UTF8 BOM for practical purpose. Where is it used practically – BeeGees Feb 13 '20 at 13:31
  • 1
    @BeeGees E.g. so that a text editor reading a file knows what encoding the file is in – canton7 Feb 13 '20 at 13:33
  • @BeeGees As an example, if the UTF-8 BOM is missing from a text file, then for backward compatibility Microsoft Excel (and many other Windows programs) will assume that file is encoded in localized ANSI encoding instead of UTF-8. – Mark Tolonen Feb 13 '20 at 17:14

1 Answers1

1

Why not? MESSAGE WAITING (U+0095)

http://www.fileformat.info/info/charset/UTF-8/list.htm

  • 1
    Thanx for the extensive list. Regards. But is there a calcultion like if the value is between so and so and so and so then those are utf8 characters – BeeGees Feb 13 '20 at 12:17