2

I have a strangeness problem. In Windows7 operation system,I try to run the command in powershell.

ruby -E UTF-8 -e "puts 'どうぞよろしくお願いします,Mr Jason'" > test.txt

When I read test.txt file:

ruby -E UTF-8 -e "puts gets" < test.txt

the result is:

�i0F0^0�0�0W0O0J0X�D0W0~0Y0,Mr Jason

I check test.txt file,find the file type encoding is Unicode,not UTF-8.

What should I do ?

How should I ensure the encoding of the output file type after redirection? Please help me.

Jason
  • 59
  • 4
  • can you take a look at : https://stackoverflow.com/questions/5163339/write-and-read-a-file-with-utf-8-encoding and see if you can get that to help? – Jad Jan 03 '23 at 12:42
  • Em...I saw the anwser,In fact,It should be powershell command combine with ruby. – Jason Jan 03 '23 at 16:01

1 Answers1

1

tl;dr

Unfortunately, the solution (on Windows) is much more complicated than one would hope:

# Make PowerShell both send and receive data as UTF-8 when talking to
# external (native) programs.
# Note: 
#  * In *PowerShell (Core) 7+*, $OutputEncoding *defaults* to UTF-8.
#  * You may want to save and restore the original settings.
$OutputEncoding = [Console]::OutputEncoding = [Text.UTF8Encoding]::new()
 
# Create a BOM-less UTF-8 file.
# Note: In *PowerShell (Core) 7+*, you can less obscurely use:
#   ruby -E UTF-8 -e "puts 'どうぞよろしくお願いします,Mr Jason'" | Set-Content test.txt
$null = New-Item -Force test.txt -Value (
  ruby -E UTF-8 -e "puts 'どうぞよろしくお願いします,Mr Jason'"
)

# Pipe the resulting file back to Ruby as UTF-8, thanks to $OutputEncoding
# Note that PowerShell has NO "<" operator - stdin input must be provided
# via the pipeline.
Get-Content -Raw test.txt | ruby -E UTF-8 -e "puts gets"

  • In terms of character encoding, PowerShell communicates with external (native) programs via two settings that contain .NET System.Text.Encoding instances:

    • $OutputEncoding specifies the encoding to use to send data TO an external program via the pipeline.

    • [Console]::OutputEncoding specifies the encoding to interpret (decoded) data FROM an external program('s stdout stream); for decoding to work as intended, this setting must match the external program's actual output encoding.

  • As of PowerShell 7.3.1, PowerShell only "speaks text" when communicating with external programs, and an intermediate decoding and re-encoding step is invariably involved - even when you're just using > (effectively an alias of the Out-File cmdlets) to send output to a file.

    • That is, PowerShell's pipelines are NOT raw byte conduits the way the are in other shells.

      • See this answer for workarounds and potential future raw-byte support.
    • Whatever output operator (>) or cmdlet (Out-File, Set-Content) you use will use its default character encoding, which is unrelated to the encoding of the original input, which has already been decoded into .NET strings when the operator / cmdlet operates on it.

      • > / Out-File in Windows PowerShell defaults to "Unicode" (UTF-16LE) encoding, which is what you saw.

      • While Out-File and Set-Content have an -Encoding parameter that allows you to control the output encoding, in Windows PowerShell they don't allow you to create BOM-less UTF-8 files; curiously, New-Item does create such files, which is why it is used above; if a UTF-8 BOM is acceptable, ... | Set-Content -Encoding utf8 will do in Windows PowerShell.

      • Note that, by contrast, PowerShell (Core) 7+, the modern, cross-platform edition now thankfully consistently defaults to BOM-less UTF-8.

        • That said, with respect to [Console]::OutputEncoding on Windows, it still uses the legacy OEM code page by default as of v7.3.1, which means that UTF-8 output from external programs is by default misinterpreted - see GitHub issue #7233 for a discussion.
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • Thank you very much.By the way,I run the command in powershell v2 .echo "すみません" | ruby -e "puts gets",But output gibberish like "??????",what should I do? – Jason Jan 04 '23 at 03:17
  • Glad to hear it helped, @Jason; given that you've since also accepted this [related answer](https://stackoverflow.com/a/74996706/45375), which explains the `??????` problem, I assume your follow-up question has been answered as well. – mklement0 Jan 04 '23 at 03:37