1

I have a script originally written in PowerShell 5.1. It uses Tee-Object to write to a log file.

Upon upgrade to 7.3, when the script appends text to the log, it appears to be Chinese. The prior log data still looks okay, but anything new is illegible.

I read that Tee-Object now uses a different encoding, so that explains why it happened, but is there a simple way to recover the new log data in the file?

2 Answers2

2

tl;dr

The following retrieves the mis-encoded part from the log file, and decodes its UTF-16LE byte representation as UTF-8 in order to restore the original text (in memory):

$restoredLines = 
  [Text.Encoding]::UTF8.GetString(
    [Text.Encoding]::Unicode.GetBytes(
      (Get-Content -Last 1 yourFile.log).TrimEnd([char] 0xFFFD)
    )
  )

If you want to rewrite the entire file with a consistent encoding (make a backup copy first):

# Note: In PowerShell (Core), not using an -Encoding parameter 
#       on output creates a BOM-less UTF-8 file.
#       To create a UTF-8 file *with a BOM*, use -Encoding utf8BOM
(Get-Content yourFile.log) |
  Select-Object -SkipLast 1 |
  Set-Content yourFile.log
Add-Content yourFile.log -Value $restoredLines.TrimEnd("`r", "`n")

  • Indeed, Windows PowerShell and PowerShell (Core) 7+ use different default encodings:

    • In Windows PowerShell, there is no consistent default encoding across cmdlets, and Tee-Object invariably uses "Unicode" (UTF-16LE, i.e. UTF-16 in Little-Endian byte order).

      • For an overview of the varying encodings use in Windows PowerShell, see the bottom section of this answer.
    • PowerShell (Core) fortunately now consistently defaults to (BOM-less) UTF-8; additionally, Tee-Object now supports an -Encoding parameter to enable use of a different encoding.

  • Therefore, an output file created with Tee-Object in Windows PowerShell will be UTF-16LE-encoded, and when you append to it with Tee-Object -Append in PowerShell (Core) (without also using -Encoding Unicode), the new content will be UTF-8-encoded, resulting in an unintended mix of two character encodings in the file.

    • Apparently, Tee-Object makes no attempt to detect the file's existing encoding, even though it could do so, the way that Add-Content does.

    • The bytes of the (variable) single-byte UTF-8 encoding, when misinterpreted as UTF-16LE result in seemingly random characters, typically from East Asian scripts, which is what you saw. The reason is that pairs of bytes are interpreted as a single character (Unicode code unit).

  • The solution above works as follows:

    • It relies on the fact that when Get-Content reads the file into .NET strings, it reads it as UTF-16LE - in both PowerShell editions, due to the BOM (byte-order mark) that Windows PowerShell placed at the start of the file.

    • Any original newlines (CRLF sequences) in the mis-encoded part are not recognized as such, due to the misinterpretation of the UTF-8 bytes, so that all lines that were added with the wrong encoding appear as a single line that is the last one (assuming that all mis-encoded content is at the end of the file), which -Last 1 retrieves.

    • The resulting misinterpreted string can then be reinterpreted as UTF-8 by first obtaining the string's UTF-16LE byte representation, and then re-decoding those bytes as UTF-8, which restores the original text.

      • Applying .TrimEnd([char] 0xFFFD) to the misinterpreted string is necessary, because if the number of bytes in the mis-encoded content happens to be an odd number, it leaves a stray single byte at the end that isn't a valid UTF-16LE character (a UTF-16LE character (code unit) requires two bytes) and is therefore parsed as (REPLACEMENT CHARACTER, U+FFFD)

      • Note:

        • A stray byte at the end by definition is the second half of a CRLF sequence in the original text, which means that the .Trim() call effectively removes a trailing newline.
        • By contrast, if the number of bytes happens to be even, a trailing newline is restored - hence the call to .TrimEnd("`r", "`n") above to remove it, if present, given that Add-Content itself appends a trailing newline.
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    I confirmed that works even with multiple lines of each encoding. I'm not even 100% sure how it works. – js2010 Jul 11 '23 at 16:30
  • 1
    @js2010, I've added an explanation to the answer. – mklement0 Jul 11 '23 at 17:15
  • 1
    The more I read the explanation, the more confused I get. :) Thank you. – Jerkle Berry Jul 12 '23 at 00:57
  • Glad to hear it helped, @JerkleBerry. I know there's a lot of information packed into the answer, and it's hard to cover all angles succinctly. If something can be made clearer, feel free to suggest. – mklement0 Jul 12 '23 at 01:12
1

It would be something like this, but you'd have to know how many bytes to skip, 6 in this case. Or try -skip 2 to skip the encoding signature, and edit the result. These are powershell 7 get-content/set-content options. I think appending without using add-content is dangerous in all commands, since the encoding isn't checked.

get-content file -AsByteStream | select -skip 6 | set-content file2 -AsByteStream
js2010
  • 23,033
  • 6
  • 64
  • 66