Powershell Set-Content encoding

Question

part of the script looks like this:

$template = Get-Content "./template/temaplate.htm" -raw
$html = $template.Replace('{{imie}}', $imie).Replace('{{nazwisko}}', $nazwisko).Replace('{{stanowisko}}', $stanowisko).Replace('{{mobile}}', $mobile).Replace('{{kapital}}', $kapital).Replace('{{telefon}}', $telefon)
Set-Content -Encoding UTF8 "output/podpis.htm" -Value $html

temaplate.htm has for example word "Sąd" or "Wrocław" but after running Set-Content all polish special characters are lost "SÄ…d", "WrocĹ‚aw" i dont really understand why. the template also have set

<meta charset="UTF-8">

What PowerShell version/host/terminal are you using? See https://stackoverflow.com/a/57134096/1701026 — iRon, Mar 21 '23 at 13:43
The font doesn't support the characters. When you use encoding it is make the data smaller in size because instead of using a two byte character only one byte is used. So each type of encoding only has 256 characters (one byte). The character 0x80 to 0xFF are unicode characters (two bytes) that are being represented as one byte. If you have a French font and a German font the data will be display different because the same byte is displayed differently. you simply need to change the Font to solve your issue. — jdweng, Mar 21 '23 at 14:04
@jdweng, the problem is unrelated to fonts (all characters in question render fine with the default font in console windows). It is solely one of misinterpreted character encoding: a BOM-less UTF-8 file is being misread as ANSI-encoded. Also, _fonts_ never change the _interpretation_ of characters (only code pages / character encodings do) - but they can make a difference with respect to whether a given character can be _rendered or not_. Your comment is misleading. — mklement0, Mar 21 '23 at 16:10
@mklement0 : Where does it say that characters are find with default font in windows? It looks like htm displays correctly. — jdweng, Mar 21 '23 at 16:31
@jdweng, please reflect on the feedback that you have been given; don't create even more unhelpful comments with non sequiturs. Your first comment was worth responding to, because it can mislead others. Your follow-up comment isn't worth responding to, except with this meta plea. In the future, I won't respond to non-sequitur follow-up comments, unless they have the potential to mislead others too. — mklement0, Mar 21 '23 at 16:39

mklement0 · Accepted Answer · 2023-03-21T16:11:14.123

Your symptom implies:

Your file is UTF-8-encoded but doesn't have a BOM.
You're using Windows PowerShell, where Get-Content defaults to the system's active ANSI code page, and therefore misinterprets your file:^[1]
- Note that Get-Content does not try to interpret the content of the file, and therefore the presence of <meta charset="UTF-8"> inside it is irrelevant.
  All that matters is whether the file starts with a Unicode BOM (which unequivocally identifies the character encoding) or not (in which case an encoding must be assumed).
- Using -Encoding utf8 only with Set-Content is then too late, because the misinterpretation has already happened.

Note that you would not have this problem in PowerShell (Core) 7+, which consistently defaults to (BOM-less) UTF-8.

Therefore, use -Encoding utf8 also in your Get-Content call:

$template = Get-Content -Encoding UTF8 "./template/temaplate.htm" -Raw
# ...
Set-Content -Encoding UTF8 "output/podpis.htm" -Value $html

Caveat:

In Windows PowerShell, Set-Content -Encoding UTF8 invariably creates a UTF-8 file with BOM. If that is undesired, use New-Item as a workaround:

# Creates a BOM-less UTF-8 file even in Windows PowerShell.
New-Item -Force "output/podpis.htm" -Value $html

(Again, in PowerShell (Core) 7+ you wouldn't have that problem: all cmdlets there create BOM-less UTF-8 files by default; -Encoding utf8bom is needed to explicitly request a BOM.)

See this answer for additional information.

^{[1] Specifically, each byte in a multi-byte UTF-8 encoding sequence representing a single non-ASCII-range character is misinterpreted as its own character, namely a character from the ANSI character set. You can reproduce this as follows, assuming that Windows-1252 is the active ANSI code page: [Text.Encoding]::GetEncoding(1252).GetString([Text.Encoding]::UTF8.GetBytes('ą')) - this yields Ä…, i.e. two (different) characters, as in your question.}

Powershell Set-Content encoding

1 Answers1