6

I have a problem with Ruby (1.9.3) and Powershell.

I need to write an interactive console app which will deal with sentences in the Polish language. I've been helped out and can retrieve ARGV elements with Polish diacritics, but the Standard Input doesn't work as I want it to.

Code illustration:

# encoding: UTF-8
target = ARGV[0].dup.force_encoding('CP1250').encode('UTF-8')
puts "string constant = dupą"
puts "dupą".bytes.to_a.to_s
puts "dupą".encoding

puts "target = " +target
puts target.bytes.to_a.to_s
puts target.encoding
puts target.eql? "dupą"

STDIN.set_encoding("CP1250", "UTF-8") 
# the line above changes nothing, it can be removed and the result is still the same
# I obviously wanted to mimic the ARGV solution

target2 = STDIN.gets
puts "target2 = " +target2
puts target2.bytes.to_a.to_s
puts target2.encoding
puts target2.eql? "dupą"

The output:

string constant = dupą
[100, 117, 112, 196, 133]
UTF-8
target = dupą
[100, 117, 112, 196, 133]
UTF-8
true
dupą //this is fed to STDIN.gets
target2 = dup
[100, 117, 112]
UTF-8
false

Apparently Ruby never gets the fourth character from the STDIN.gets. If I write a longer string, like dupąlalala, still only the three initial bytes appear within the program.

  • I've tried enumerating the bytes and looping with getc, but they never seem to reach Ruby (where are they lost?)
  • I've used chcp 65001 (doesn't seem to change a thing)
  • I've changed my $OutputEncoding to [Console]::OutputEncoding; it now looks like this:

     IsSingleByte      : True
     BodyName          : ibm852
     EncodingName      : Środkowoeuropejski (DOS)
     HeaderName        : ibm852 
     WebName           : ibm852
     WindowsCodePage   : 1250
     IsBrowserDisplay  : True
     IsBrowserSave     : True
     IsMailNewsDisplay : False
     IsMailNewsSave    : False
     EncoderFallback   : System.Text.InternalEncoderBestFitFallback
     DecoderFallback   : System.Text.InternalDecoderBestFitFallback
     IsReadOnly        : True
     CodePage          : 852
    
  • I'm using the Consolas font

What do I do to read Polish diacritics properly in Powershell?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
SY.
  • 181
  • 1
  • 10
  • 1
    Does it work when called from a non-PowerShell cmd.exe? – Damian Powell Jun 02 '12 at 13:03
  • I know this question is very old, but anyway: this did not work in plain cmd.exe. That same example (with encodings changed, of course) works in Linux. I have rebuilt my project to use files instead of standard input. – SY. Mar 20 '13 at 08:16
  • 1
    I know this question is very old, but anyway: WE DEMAND AN ANSWER! ;) – Henrik Apr 29 '13 at 12:00

2 Answers2

1

I found out some relevant info. Not sure it's exactly the right info. But, hey, OP already got another solution.

# Get "encoding" for code page 1250 (Central European)
$en=[System.Text.Encoding]::GetEncoding(1250)
# Looks like this:
IsSingleByte      : True
BodyName          : iso-8859-2
EncodingName      : Central European (Windows)
HeaderName        : windows-1250
WebName           : windows-1250
WindowsCodePage   : 1250
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
EncoderFallback   : System.Text.InternalEncoderBestFitFallback
DecoderFallback   : System.Text.InternalDecoderBestFitFallback
IsReadOnly        : True
CodePage          : 1250

# Change STDIN's input encoding
[console]::InputEncoding=$en
$x = Read-Host 
# I typed in dupą 
#  (I set Polish in Languate Bar. 
#   Final letter is apostrophe on US English keyboard)
[int[]][char[]]$x
# output is: 100 117 112 261 (in hex): 64 75 70 105
# the final character (261) is "Latin Small Letter A with Ogonek" 
Χpẘ
  • 3,403
  • 1
  • 13
  • 22
  • I used this method to capture the current encoding and change it to 1250 because the output from Win32 NetDfsEnum gets mangled in PowerShell, I should note that the same data retrieved with the Dfsn cmdlet is not afflicted but are too slow for my purposes. – sean_m Nov 01 '14 at 09:17
0

.Net 4.x expects and creates a Byte Order Mark (BOM) with CHCP 65001 (UTF-8) on stdin.

This appears to be fixed in .Net Core, but requires changing Console.StandardInputEncoding in 4.x to properly hook communication with child processes that don't have similar assumptions.

Bert Huijben
  • 19,525
  • 4
  • 57
  • 73