Perl unicode support from console (@ARGV) /Windows/

Question

I'm trying to get unicode characters as arguments in perl script:

C:\>perl test.pl ö

#----
# test.pl
#----
#!/usr/bin/perl
use warnings;
use strict;

my ($name, $number) = @ARGV;

if (not defined $name) {
    die "Need name\n";
}

if (defined $number) {
    print "Save '$name' and '$number'\n";
    # save name/number in database
    exit;
}

if ($name eq 'ö') {
    print "Fetch umlaut 'oe'\n";
} elsif ($name eq 'o') {
    print "Fetch simple 'o'\n";
} else {
    print "Fetch other '$name'\n";
}

print "ü";

and I get the output:

Fetch simple 'o'
ü

I've tested the code (algorithm) in python 3 and it works, so I get "ö". But obviously in perl there is something more that I must add or set. It doesn't matter if it is Strawberry Perl or ActiveState Perl. I get the same result.

Thanks in advance!

score 4 · Answer 1 · answered Oct 06 '16 at 18:07

4

#!/usr/bin/perl

use strict;
use warnings;

my $encoding_in;
my $encoding_out;
my $encoding_sys;
BEGIN {
    require Win32;

    $encoding_in  = 'cp' . Win32::GetConsoleCP();
    $encoding_out = 'cp' . Win32::GetConsoleOutputCP();
    $encoding_sys = 'cp' . Win32::GetACP();

    binmode(STDIN,  ":encoding($encoding_in)");
    binmode(STDOUT, ":encoding($encoding_out)");
    binmode(STDERR, ":encoding($encoding_out)");
}

use Encode qw( decode );

{
    my ($name, $number) = map { decode($encoding_sys, $_) } @ARGV;

    if (not defined $name) {
        die "Need name\n";
    }

    if (defined $number) {
        print "Save '$name' and '$number'\n";
        # save name/number in database
        exit;
    }

    if ($name eq 'ö') {
        print "Fetch umlaut 'oe'\n";
    } elsif ($name eq 'o') {
        print "Fetch simple 'o'\n";
    } else {
        print "Fetch other '$name'\n";
    }

    print "ü";
}

Also, you should add use feature qw( unicode_strings ); and/or encode your file using UTF-8 and add use utf8;.

answered Oct 06 '16 at 18:07

ikegami

367,544
15
269
518

Do you have an opinion regarding `Encode::Locale`? It seems to replace a handful of the lines of code you've got there. – tjd Oct 06 '16 at 19:20
@tjd, Looks like you could indeed use that. Feel free to post an answer, and I'll upvote and possibly even delete mine! – ikegami Oct 06 '16 at 19:29
It doesn't work. I get only Fetch simple 'o' ü Obviously it parses automatically "ö" to "o". This is perl 5, version 24, subversion 0 (v5.24.0) built for MSWin32-x64-multi-thread. – Oct 08 '16 at 09:37
This has been tested. Can you provide the values of the encoding variables the program populates? – ikegami Oct 08 '16 at 17:56
@ikegami I can change the encoding but the sys encoding remains the same. Encoding in: cp866 Encoding out: cp866 Encoding sys: cp1251 or Encoding in: cp65001 Encoding out: cp65001 Encoding sys: cp1251 – Oct 09 '16 at 08:03
Each system call that exchanges text has an "A" ("ANSI") and a "W" ("Wide") variant. Perl is using the "A" variant to get the command-line args. I believe the "ANSI" encoding is set by your system language. In your case, it's cp1251, which supports cyrillic letters where cp1252 supports accented latin letters. You could "simply" call the "W" variant of the call yourself to get the args encoded using UTF-16le. I recommend a new question for that. (You need to figure out the name of the system call, then you would use Win32::API to call it ...unless someone's already written a module to do so.) – ikegami Oct 10 '16 at 01:10

tjd · Answer 2 · 2016-10-10T13:16:23.423

3

In addition to ikagami's fine answer, I'm a fan of the Encode::Locale module that automatically creates aliases for the current console's code pages. It works well with Win32, OS X & other flavors of *nix.

#!/usr/bin/perl

use strict;
use warnings;

# These two lines make life better when you leave the world of ASCII
# Just remember to *save* the file as UTF8....
use utf8;
use feature 'unicode_strings';

use Encode::Locale 'decode_argv';         # We'll use the console_in & console_out aliases as well as decode_argv().
use Encode;

binmode(STDIN,  ":encoding(console_in)");
binmode(STDOUT, ":encoding(console_out)");
binmode(STDERR, ":encoding(console_out)");

decode_argv( );   # Decode ARGV in place
my ($name, $number) = @ARGV;

if (not defined $name) {
    die "Need name\n";
}

if (defined $number) {
    print "Save '$name' and '$number'\n";
    # save name/number in database
    exit;
}

if ($name eq 'ö') {
    print "Fetch umlaut 'oe'\n";
} elsif ($name eq 'o') {
    print "Fetch simple 'o'\n";
} else {
    print "Fetch other '$name'\n";
}

print "ü";

Perhaps it's only syntactic sugar, but it makes easy reading and promotes cross-platform compatibility.

edited Oct 10 '16 at 13:16

answered Oct 06 '16 at 19:49

tjd

4,064
1
24
34

Undefined subroutine &main::decode_argv called at vidCmpr2.pl line 18. – Oct 08 '16 at 09:42
@Banish You're right. I forgot to import that symbol in the use line. Thanks for the reminder. – tjd Oct 10 '16 at 13:17
Yes, I had to import the package. Now it compiles successfully but that's all. I've tested it changing the console code page (chcp 1252) or even with perl6. It parses "ö" as simple "o". Python and java parse it without problems. – Oct 10 '16 at 20:20
PS: Sorry. I'm wrong. It works only on python. It doesn't work on java. – Oct 10 '16 at 21:48
@Banish The last step is to make sure the strings you are comparing are in the same (Normalized) form. – tjd Oct 11 '16 at 02:59

circulosmeos · Answer 3 · 2018-11-14T19:06:34.883

I think that the code answers to this question are well pointed but not complete:

that way , it is very complicated to construct a script with all the code page + source codification in mind, and moreover, it would be harder to make it portable: ö may be known to latin alphabet users, but の or 렌 also exist...
they may run ok with chars in a particular code page, but with chars outside it, they will fail (which is probably the case with some users in the comments). Note that Windows' Code Pages are previous to Unicode.
The fundamental problem is that Perl 5 for Windows is not compiled with Unicode support as Windows understands it: it is just a port of the linux code, and so, almost all Unicode chars are mangled before they even reach the Perl code.

A longer technical explanation (and a C patch!) is provided by A. Sinan Unur's page Fixing Perl's Unicode problems on the command line on Windows: A trilogy in N parts (under Artistic License 2.0).

So (but not for the faint of spirit): a recompilation of perl.exe is possible and almost fully Unicode compliant in Windows. Hopefully they'll be integrated some day in the source code... Until them I've resumed some detailed instructions to patch perl.exe here.

Note also that a proper command console with full Unicode support is needed. A quick solution is to use ConEmu, but Windows' cmd.exe could also work after some heavy tweaks.

score 0 · Answer 4 · answered Jun 26 '19 at 09:06

I don't know if this is the solution for very scenario, but I could get away by using the parameter "-CAS" when calling my script.

Example:

Script_1:

use strict;
use utf8;

$|++; # Prevent buffering issues


my ($arg) = @ARGV;
save_to_file('test.txt', $arg);

sub save_to_file{   

    my ($filename, $content) = @_;

    open(my $fh, '>:encoding(UTF-8)', $filename) or die "Can't open < $filename: $!";;
    print $fh $content;
    close $fh;

    return;
}

Script_2 calling 1:

use strict;
use utf8;


execute_command();

sub execute_command {


    my $command = "perl -CAS simple_utf_string.pl äääöööü";

    # Execute command
    print "The command to run is: $command\n";
    open my $command_pipe, "-|:encoding(UTF-8)", $command or die "Pipe from $command failed: $!";
    while (<$command_pipe>) {
        print  $_;
    }
}

Result in: text.txt:

äääöööü

Perl unicode support from console (@ARGV) /Windows/

4 Answers4