0

This is a basic question, but I can't find anything on it, since I don't know what to search — each of my tries have come up with unrelated results.

If I use Text.Encoding.ASCII.GetBytes to convert a string into ASCII, does each byte represent exactly one character? Does the following code work as exactly intended in all circumstances (for all Strings other than the examples)?

Dim t1() As Byte = Text.Encoding.ASCII.GetBytes("Hello ")
Dim t2() As Byte = Text.Encoding.ASCII.GetBytes("World")

Dim msg As String = Text.Encoding.ASCII.GetString(t1.Concat(t2).ToArray)

Now msg should be "Hello World".

I would like this to work as I don't want to have to convert data I receive back to Strings in order to manipulate it before it is sent again.

What if I used something other than ASCII (like UTF-8, for example)?

Shuri2060
  • 729
  • 6
  • 21

2 Answers2

2

If I use Text.Encoding.ASCII.GetBytes to convert a string into ASCII, does each byte represent exactly one character?

Yes. ASCII is a 7bit encoding, it does not support multi-byte characters. Any Unicode codepoint above U-007F will get converted to a ? character in ASCII.

If you were to use UTF-7 instead, for instance, it can encode individual Unicode codepoints into a sequence of multiple ASCII characters.

Does the following code work as exactly intended in all circumstances (for all Strings other than the examples)?

In your particular example, yes (provided you are using LINQ's Concat() method - there are other ways to concat arrays together). There is no data loss.

But for other examples, just know that you will have data loss if you convert non-ASCII characters to ASCII, or otherwise mismatch encodings between GetBytes() and GetString().

You can certainly manipulate byte arrays. Just make sure the arrays are in the same encoding if you merge them together.

Community
  • 1
  • 1
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • 1
    If you prefer an exception to silent data loss via a replacement characters (default is ?) when encoding a character that ASCII doesn't support, you can [create your own encoder](https://msdn.microsoft.com/en-us/library/ms404377(v=vs.110).aspx#Exception) based on the standard ASCII encoder. – Tom Blodget Aug 29 '16 at 22:09
1

.NET strings are counted sequences of UTF-16 code units (char), one or two of which encode a Unicode codepoint (int Char.ConvertToUtf32 ). Some codepoints are "combining characters", which when applied to a preceding "base character" form a grapheme (which is then rendered by a font into a glyph).

An encoder from Unicode to an encoding of another character set should attempt to preserve graphemes. In .NET, a grapheme is called a "text element."

So, yes, you can combine encoded byte sequences as long as you haven't defeated the encoder by converting parts of a grapheme into different byte sequences. If you are breaking a string into two before encoding, see TextElementEnumerator and StringInfo class.

Tom Blodget
  • 20,260
  • 3
  • 39
  • 72