1

I have a script that merges files and that works fine - but characters like åäö looks not good in the output file

Here is the complete script:

$startOfToday = (Get-Date).Date
Get-ChildItem "C:\TEST -include *.* -Recurse |
Where-Object LastWriteTime -gt $startOfToday | ForEach-Object {gc $_; ""} | 
Out-File "C:\$(Get-Date -Format 'yyyy/mm/dd').txt"

In the files in looks like this for example

Order ID 1

Order ID 2

This is för får

In the output it gets like this for the last row

Order ID 1

Order ID 2

får för fär

is there a way to make those characters appear in the output file as they appear in the first file?

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
Henrik Rosqvist
  • 65
  • 1
  • 1
  • 7
  • 2
    Try `Set-Content` instead of `Out-File`. `Out-File` defaults to UTF16LE encoding. My thoughts are if `Get-Content` reads the data as intended then `Set-Content` should be just as intelligent? I could be wrong. – AdminOfThings Feb 01 '21 at 14:54
  • 2
    How are the source files encoded? If you remove the `| Out-File ...` part of the pipeline, does `ö` and `å` render correctly in your terminal? – Mathias R. Jessen Feb 01 '21 at 14:55
  • Set-Content did it! Thanks – Henrik Rosqvist Feb 01 '21 at 14:58
  • 1
    @AdminOfThings, `Set-Content` is effective _in this specific scenario_, because it applies the same misinterpretation that `Get-Content` applies on reading also on writing, pass-through style, but it's important to note that _in memory_ the strings will be incorrectly represented. – mklement0 Feb 01 '21 at 15:07

1 Answers1

0

The implication is that your input files are UTF-8-encoded without a BOM, which in Windows PowerShell are (mis)interpreted to be ANSI-encoded (using the system's active ANSI code page, such as Windows-1252).

The solution is to tell gc (Get-Content) explicitly what encoding to use, via the -Encoding parameter:

Get-ChildItem C:\TEST -include *.* -Recurse |
  Where-Object LastWriteTime -gt $startOfToday | 
    ForEach-Object { Get-Content -Encoding Utf8 $_; ""} | 
      Out-File "C:\$(Get-Date -Format 'yyyy/mm/dd').txt"

Note that PowerShell never preserves the input encoding automatically, therefore, in the absence of using -Encoding with Out-File, its default encoding is used, which is "Unicode" (UTF-16LE) in Windows PowerShell.

While PowerShell (Core) 7+ also doesn't preserve input encodings, it consistently defaults to BOM-less UTF-8, so your original code would work as-is there.

For more information about default encodings in Windows PowerShell vs. PowerShell (Core) 7+, see this answer.


Note: As AdminOfThings suggests in a comment, simply replacing Out-File with Set-Content in your original code also works in this particular case, because the same misinterpretation of the encoding is then performed on both in- and output, and the data is simply being passed through. This isn't a general solution, however, notably not if you need to process the strings in memory first, before saving them to a file.

mklement0
  • 382,024
  • 64
  • 607
  • 775