2

When I generate a random byte sequence, decode the sequence into a string representation, then encode it back to a byte array, it is different from the original encoded sequence. See example below:

[byte[]]$key = [byte[]]::new(32)
[System.Security.Cryptography.RandomNumberGenerator]::Create().GetBytes($key)
$key

output: 15 173 198 89 162 161 144 104 125 86 154 204 166 238 193 40 51 58 167 0 150 118 37 203 198 161 64 229 101 25 176 201

$decoded = [System.Text.Encoding]::UTF8.GetString($key)
$encoded = [System.Text.Encoding]::UTF8.GetBytes($decoded)
$encoded

output: 15 239 191 189 239 191 189 89 239 191 189 239 191 189 239 191 189 104 125 86 239 191 189 204 166 239 191 189 239 191 189 40 51 58 239 191 189 0 239 191 189 118 37 239 191 189 198 161 64 239 191 189 101 25 239 191 189 239 191 189

The byte sequence was clearly modified after decoding/encoding. This process works fine if I use [System.Text.Encoding]::Unicode.... It seems that UTF8 can't handle certain bytes, but I was under the impression that UTF8 should be able to handle any character in the unicode standard. Can someone explain why this happens? Please and thanks

AKozak
  • 85
  • 5
  • 3
    You're starting out with random bytes, then trying to UTF8-*decode* them into a string. That doesn't make any sense since they are not the valid UTF-8 encoding of a valid string. There are many sequences of bytes that are impossible to achieve using UTF-8 encoding of a string, therefore they cannot be decoded into a string. I don't know what .net does in such a case but in other languages the decoder will simply replace undecodable byte sequences with a default character, thereby corrupting the data. – President James K. Polk Dec 29 '22 at 21:45
  • 2
    UTF8 can handle any character in the Unicode Standard. But neither Unicode nor UTF8 are random; they both have rules. Random bytes probably won't follow the rules and are neither Unicode nor UTF8. Think of other data formats, like JPEG or PDF. Those are made of bytes, right? But not random bytes; a bunch of random bytes probably won't be JPEG or PDF either. – Dour High Arch Dec 29 '22 at 21:51
  • 1
    https://blog.marcgravell.com/2013/02/how-many-ways-can-you-mess-up-io.html - encoding backwards – Marc Gravell Dec 29 '22 at 21:51
  • What are you actually trying to achieve with this? Is it just for academical purposes or is there a real, practical use case? – zett42 Dec 29 '22 at 21:59
  • @zett42 Purely academic. Just something I came across when messing around with the .net prng for generating aes keys. Never needed to put much thought into character encodings before, so I'm just curious. – AKozak Dec 29 '22 at 22:43
  • 1
    The frequently repeated sequence 239 191 189 or hex encoded 0xEFBFBD in the corrupted data is the UTF-8 encoding of the Unicode replacement character [U+FFFD](https://www.compart.com/en/unicode/U+FFFD) used when decoding failed. – Topaco Dec 30 '22 at 10:24

2 Answers2

1

I'm not nearly an expert on encodings but here are few notes:

  1. From Encoding.UTF8 docs:

    This property returns a UTF8Encoding object that encodes Unicode (UTF-16-encoded) characters into a sequence of one to four bytes per character, and that decodes a UTF-8-encoded byte array to Unicode (UTF-16-encoded) characters.

  2. Not every possible single byte represents a single valid character in UTF-8 encoding. UTF-8 is a variable-width character encoding standard that uses between one and four eight-bit bytes to represent all valid Unicode code points. If you check wiki article for encoding explanation you will see that single byte handles only 128 code points (0-127), so following will already "break" the encode-decode:

    var s = Encoding.UTF8.GetString(new byte[] { 128 });
    var bytes1 = Encoding.UTF8.GetBytes(s); // [239, 191, 189]
    
  3. Personally I would try using Convert.ToBase64String()/Convert.FromBase64String() (or Convert.ToHexString()/Convert.FromHexString() if available) pair to encode-decode.

Guru Stron
  • 102,774
  • 10
  • 95
  • 132
0

Guru Stron's helfpul answer explains why there is no good reason to bring text encodings into the mix, given that you're dealing with random, arbitrary bytes.

Your best bet for converting byte arrays ([byte[]) to and from strings is to use a separator-less list of two-digit hex representations, using [System.BitConverter]::ToString() in Windows PowerShell, or, more easily, in PowerShell (Core) v7.1+, [System.Convert]::ToHexString() and [System.Convert]::FromHexString():

# Simple sample input byte array.
[byte[]] $bytes = 9, 65, 66 # same as: 0x9, 0x41, 0x42

# PowerShell (Core) 7.1+ / .NET 5+ solution:
if ($PSVersionTable.PSVersion.Major -ge 7) {

  # Convert the byte array to a "byte string"
  # -> '094142'
  $byteString = [Convert]::ToHexString($bytes)

  # Convert the "byte string" back to a [byte[]] array.
  $bytesAgain = [Convert]::FromHexString($byteString)

}
# Windows PowerShell solution:
else {

  # Convert the byte array to a "byte string".
  # [BitConverter]::ToString() uses '-' as a separator, which
  # -replace '-' removes.
  # -> '094142'
  $byteString = [BitConverter]::ToString($bytes) -replace '-'

  # Convert the "byte string" back to a [byte[]] array: manual parsing required.
  [byte[]] $bytesAgain = -split ($byteString -replace '..', '0x$& ')  
}

# Show results:
[pscustomobject] @{
  InputBytes = $bytes
  ByteString = $byteString
  ReconvertedBytes = $bytesAgain
}

Output:

InputBytes  ByteString ReconvertedBytes
----------  ---------- ----------------
{9, 65, 66} 094142     {9, 65, 66}
  • For an explanation of the -split ($byteString -replace '..', '0x$& ') operation used above, see this answer.
mklement0
  • 382,024
  • 64
  • 607
  • 775