tl;dr
The following retrieves the mis-encoded part from the log file, and decodes its UTF-16LE byte representation as UTF-8 in order to restore the original text (in memory):
$restoredLines =
[Text.Encoding]::UTF8.GetString(
[Text.Encoding]::Unicode.GetBytes(
(Get-Content -Last 1 yourFile.log).TrimEnd([char] 0xFFFD)
)
)
If you want to rewrite the entire file with a consistent encoding (make a backup copy first):
# Note: In PowerShell (Core), not using an -Encoding parameter
# on output creates a BOM-less UTF-8 file.
# To create a UTF-8 file *with a BOM*, use -Encoding utf8BOM
(Get-Content yourFile.log) |
Select-Object -SkipLast 1 |
Set-Content yourFile.log
Add-Content yourFile.log -Value $restoredLines.TrimEnd("`r", "`n")
Indeed, Windows PowerShell and PowerShell (Core) 7+ use different default encodings:
In Windows PowerShell, there is no consistent default encoding across cmdlets, and Tee-Object
invariably uses "Unicode" (UTF-16LE, i.e. UTF-16 in Little-Endian byte order).
- For an overview of the varying encodings use in Windows PowerShell, see the bottom section of this answer.
PowerShell (Core) fortunately now consistently defaults to (BOM-less) UTF-8; additionally, Tee-Object
now supports an -Encoding
parameter to enable use of a different encoding.
Therefore, an output file created with Tee-Object
in Windows PowerShell will be UTF-16LE-encoded, and when you append to it with Tee-Object -Append
in PowerShell (Core) (without also using -Encoding Unicode
), the new content will be UTF-8-encoded, resulting in an unintended mix of two character encodings in the file.
Apparently, Tee-Object
makes no attempt to detect the file's existing encoding, even though it could do so, the way that Add-Content
does.
The bytes of the (variable) single-byte UTF-8 encoding, when misinterpreted as UTF-16LE result in seemingly random characters, typically from East Asian scripts, which is what you saw. The reason is that pairs of bytes are interpreted as a single character (Unicode code unit).
The solution above works as follows:
It relies on the fact that when Get-Content
reads the file into .NET strings, it reads it as UTF-16LE - in both PowerShell editions, due to the BOM (byte-order mark) that Windows PowerShell placed at the start of the file.
Any original newlines (CRLF sequences) in the mis-encoded part are not recognized as such, due to the misinterpretation of the UTF-8 bytes, so that all lines that were added with the wrong encoding appear as a single line that is the last one (assuming that all mis-encoded content is at the end of the file), which -Last 1
retrieves.
The resulting misinterpreted string can then be reinterpreted as UTF-8 by first obtaining the string's UTF-16LE byte representation, and then re-decoding those bytes as UTF-8, which restores the original text.
Applying .TrimEnd([char] 0xFFFD)
to the misinterpreted string is necessary, because if the number of bytes in the mis-encoded content happens to be an odd number, it leaves a stray single byte at the end that isn't a valid UTF-16LE character (a UTF-16LE character (code unit) requires two bytes) and is therefore parsed as �
(REPLACEMENT CHARACTER, U+FFFD
)
Note:
- A stray byte at the end by definition is the second half of a CRLF sequence in the original text, which means that the
.Trim()
call effectively removes a trailing newline.
- By contrast, if the number of bytes happens to be even, a trailing newline is restored - hence the call to
.TrimEnd("`r", "`n")
above to remove it, if present, given that Add-Content
itself appends a trailing newline.