Issues with specific characters in outfile

Question

I have a script that merges files and that works fine - but characters like åäö looks not good in the output file

Here is the complete script:

$startOfToday = (Get-Date).Date
Get-ChildItem "C:\TEST -include *.* -Recurse |
Where-Object LastWriteTime -gt $startOfToday | ForEach-Object {gc $_; ""} | 
Out-File "C:\$(Get-Date -Format 'yyyy/mm/dd').txt"

In the files in looks like this for example

Order ID 1

Order ID 2

This is för får

In the output it gets like this for the last row

Order ID 1

Order ID 2

fÃ¥r fÃ¶r fÃ¤r

is there a way to make those characters appear in the output file as they appear in the first file?

Try `Set-Content` instead of `Out-File`. `Out-File` defaults to UTF16LE encoding. My thoughts are if `Get-Content` reads the data as intended then `Set-Content` should be just as intelligent? I could be wrong. — AdminOfThings, Feb 01 '21 at 14:54
How are the source files encoded? If you remove the `| Out-File ...` part of the pipeline, does `ö` and `å` render correctly in your terminal? — Mathias R. Jessen, Feb 01 '21 at 14:55
@AdminOfThings, `Set-Content` is effective _in this specific scenario_, because it applies the same misinterpretation that `Get-Content` applies on reading also on writing, pass-through style, but it's important to note that _in memory_ the strings will be incorrectly represented. — mklement0, Feb 01 '21 at 15:07

mklement0 · Accepted Answer · 2021-02-01T15:12:29.817

The implication is that your input files are UTF-8-encoded without a BOM, which in Windows PowerShell are (mis)interpreted to be ANSI-encoded (using the system's active ANSI code page, such as Windows-1252).

The solution is to tell gc (Get-Content) explicitly what encoding to use, via the -Encoding parameter:

Get-ChildItem C:\TEST -include *.* -Recurse |
  Where-Object LastWriteTime -gt $startOfToday | 
    ForEach-Object { Get-Content -Encoding Utf8 $_; ""} | 
      Out-File "C:\$(Get-Date -Format 'yyyy/mm/dd').txt"

Note that PowerShell never preserves the input encoding automatically, therefore, in the absence of using -Encoding with Out-File, its default encoding is used, which is "Unicode" (UTF-16LE) in Windows PowerShell.

While PowerShell (Core) 7+ also doesn't preserve input encodings, it consistently defaults to BOM-less UTF-8, so your original code would work as-is there.

For more information about default encodings in Windows PowerShell vs. PowerShell (Core) 7+, see this answer.

Note: As AdminOfThings suggests in a comment, simply replacing Out-File with Set-Content in your original code also works in this particular case, because the same misinterpretation of the encoding is then performed on both in- and output, and the data is simply being passed through. This isn't a general solution, however, notably not if you need to process the strings in memory first, before saving them to a file.

Issues with specific characters in outfile

1 Answers1