In Windows PowerShell, the default character encoding when reading from / writing to[1] files is "ANSI", i.e., the legacy 8-bit code page implied by the active system locale.
(By contrast, PowerShell Core defaults to UTF-8.)
For instance, the code page associated with the system locale on an US-English system is 1252
, i.e., Windows-1252, where code point 0x93
is the non-ASCII “
quotation mark.
Howere, once a text file's content has been read into memory, in memory a string's characters are represented as UTF-16LE code units, i.e., as .NET [string]
instances.
As a Unicode character, “
has code point U+201c
, expressed as 0x201c
in UTF-16LE.
Therefore - because in memory all strings are UTF-16LE code units - what you need to replace is [char] 0x201c
:
$q1 = [char] 0x201c # “
Get-ChildItem *.csv -Recurse | ForEach-Object {
(Get-Content $_.FullName) -replace $q1, '""' | Set-Content $_.FullName
}
Note that Set-Content
too uses the default character encoding, so the rewritten files will use "ANSI" encoding too - use the -Encoding
parameter to change the output encoding, if desired.
Also note the (...)
around the Get-Content
call, which ensures that the input file i read into memory in full up front, which enables writing back to the same file in the same pipeline.
While this approach is convenient, note that it bears a slight risk of data loss if writing back to the input file is interrupted before completion.
Converting an "ANSI" code point to a Unicode code point
The following shows how an "ANSI" (8-bit) code point such as 0x93
can be converted to its equivalent UTF-16 code point, 0x201c
:
# Convert an array of "ANSI" code points (1 byte each) to the UTF-16
# string they represent.
# Note: In Windows PowerShell, [Text.Encoding]::Default contains
# the "ANSI" encoding set by the system locale.
$str = [Text.Encoding]::Default.GetString([byte[]] 0x93) # -> '“'
# Get the UTF-16 code points of the characters making up the string.
$codePoints = [int[]] [char[]] $str
# Format the first and only code point as a hex. number.
'0x{0:x}' -f $codePoints[0] # -> '0x201c'
[1] Writing files with Set-Content
, that is; using Out-File
/ >
, by contrast, creates UTF-16LE ("Unicode") files. The cmdlets in Windows PowerShell display a bewildering array of differing encodings: see this answer. Fortunately, PowerShell Core now consistently defaults to (BOM-less) UTF-8.