I am trying to pass in a string that uses the UNICODE character: "right single quotation mark" Decimal: 8217 Hex: \x{2019}
Perl is not receiving the character correctly. Let me show you the details:
Perl Script follows (we'll call it test.pl
):
use warnings;
use strict;
use v5.32;
use utf8; # Some UTF-8 chars are present in the code's comments
# Get the first argument
my $arg=shift @ARGV or die 'This script requires one argument';
# Get some env vars with sensible defaults if absent
my $lc_all=$ENV{LC_ALL} // '{unset}';
my $lc_ctype=$ENV{LC_CTYPE} // '{unset}';
my $lang=$ENV{LANG} // '{unset}';
# Determine the current Windows code page
my ($active_codepage)=`chcp 2>NUL`=~/: (\d+)/;
# Our environment
say "ENV: LC_ALL=$lc_all LC_CTYPE=$lc_ctype LANG=$lang";
say "Active code page: $active_codepage"; # Note: 65001 is UTF-8
# Saying the wrong thing, expected: 0’s #### Note: Between the '0' and the 's'
# is a "right single quotation mark" and should be in utf-8 =>
# Decimal: 8217 Hex: \x{2019}
# For some strange reason the bytes "\x{2019}" are coming in as "\x{92}"
# which is the single-byte CP1252 representation of the character "right
# single quotation mark"
# The whole workflow is UTF-8, so I don't know where there is a CP1252
# translation of the input argument (outside of Perl that is)
# Display the value of the argument and its length
say "Argument: $arg length: ",length($arg);
# Display the bytes that make up the argument's string
print("Argument hex bytes:");
for my $chr_idx (0 .. length($arg)-1)
{
print sprintf(' %02x',ord(substr($arg,$chr_idx,1)));
}
say ''; # Newline
I run the Perl script as follows:
V:\videos>c:\perl\5.32.0\bin\perl test.pl 0’s
Output:
ENV: LC_ALL=en-US.UTF-8 LC_CTYPE={unset} LANG=en_US.UTF-8
Argument: 0s length: 3
Argument hex bytes: 30 92 73
OK, perhaps we also need to specify UTF-8 everything (stdin/out/err and command line args)?
V:\videos>c:\perl\5.32.0\bin\perl -CSDA test.pl 0’s
Output:
ENV: LC_ALL=en-US.UTF-8 LC_CTYPE={unset} LANG=en_US.UTF-8
Active code page: 65001
Argument: 0s length: 3
Argument hex bytes: 30 92 73
OK, let's try completely remove all LC*
/LANG
env vars, resulting in:
@SET LC_ALL=
@SET LANG=
@REM Proof that everything has been cleared
@REM Note: The caret before the vertical bar escapes it,
@REM because I have grep set up to run through a
@REM batch file and need to forward args
@set | grep -iP "LC^|LANG" || echo %errorlevel%
Output:
1
Let's try executing the script again, with UTF-8:
V:\videos>c:\perl\5.32.0\bin\perl -CSDA 0’s
Output (no change, other than that the LC*
/LANG
env vars have been cleared):
ENV: LC_ALL={unset} LC_CTYPE={unset} LANG={unset}
Active code page: 65001
Argument: 0s length: 3
Argument hex bytes: 30 92 73
At this point, I decided to go outside of Perl and see what Windows 10 itself is doing with my command line argument. I have a little utility I wrote in C# a while back that helps troubleshoot command line argument issues and used that to test. The output should be self explanatory:
V:\videos>ShowArgs 0’s
Filename: |ShowArgs.exe|
Pathname: |c:\bin\ShowArgs.exe|
Work dir: |V:\videos|
Command line: ShowArgs 0’s
Raw command line characters:
000: |ShowArgs |: S (083:53) h (104:68) o (111:6F) w (119:77) A (065:41) r (114:72) g (103:67) s (115:73) (032:20) (032:20)
010: |0’s |: 0 (048:30) ’ (8217:2019) s (115:73)
Command line args:
00: |0’s|
This shows several things:
- The argument passed in does not need to be quoted (I didn't think it would)
- The argument is being correctly passed in, in UTF-8 to the application by Windows
I can't for the life of me figure out why Perl is not receiving the argument as UTF-8 at this point.
Of course as an absolute hack, if I was to throw in the following at the bottom of my Perl script, the issue would get resolved. But I would like to understand why Perl is not receiving the argument as UTF-8:
# ... Appended to original script shown at top ...
use Encode qw(encode decode);
sub recode
{
return encode('UTF-8', decode( 'cp1252', $_[0] ));
}
say "\n@{['='x60]}\n"; # Output separator
say "Original arg: $arg";
say "After recoding CP1252 -> UTF-8: ${\recode($arg)}";
Script execution:
V:\videos>c:\perl\5.32.0\bin\perl test.pl 0’s
New output:
ENV: LC_ALL=en_US.UTF-8 LC_CTYPE={unset} LANG={unset}
Active code page: 65001
Argument: 0s length: 3
Argument hex bytes: 0030 0092 0073
============================================================
Original arg: 0s
After recoding CP1252 -> UTF-8: 0’s
UPDATE
I built a simple C++ test app to get a better handle on what is happening.
Here is the source code:
#include <cstdint>
#include <cstring>
#include <iostream>
#include <iomanip>
int main(int argc, const char *argv[])
{
if (argc!=2)
{
std::cerr << "A single command line argument is required\n";
return 1;
}
const char *arg=argv[1];
std::size_t arg_len=strlen(arg);
// Display argument as a string
std::cout << "Argument: " << arg << " length: " << arg_len << '\n';
// Display argument bytes
// Fill with leading zeroes
auto orig_fill_char=std::cout.fill('0');
std::cout << "Bytes of argument, in hex:";
std::cout << std::hex;
for (std::size_t arg_idx=0; arg_idx<arg_len; ++arg_idx)
{
// Note: The cast to uint16_t is necessary because uint8_t is formatted
// "specially" (i.e., still as a char and not as an int)
// The cast through uint8_t is necessary due to sign extension of
// the original char if going directly to uint16_t and the (signed) char
// value is negative.
// I could have also masked off the high byte after the cast, with
// insertion code like (Note: Parens required due to precedence):
// << (static_cast<uint16_t>(arg[arg_idx]) & 0x00ff)
// As they say back in Perl-land, "TMTOWTDI!", and in this case it
// amounts to the C++ version of Perl "line noise" no matter which
// way you slice it. :)
std::cout << ' '
<< std::setw(2)
<< static_cast<uint16_t>(static_cast<uint8_t>(arg[arg_idx]));
}
std::cout << '\n';
// Restore the original fill char and go back to decimal mode
std::cout << std::setfill(orig_fill_char) << std::dec;
}
Built as 64-bit console based application with the MBCS character set setting, the above code was run with:
testapp.exe 0’s
..., and produced the following output:
Argument: 0s length: 3
Argument bytes: 30 92 73
So, it is Windows, after all, at least in part. I need to build a UNICODE character set version of this app and see what I get.
Final Update on How to Fix This Once and for All
Thanks to Eryk Sun's comments to ikegami's accepted answer and links in that answer, I have found the best solution, at least with regard to Windows 10. I will now outline the specific steps to follow to force Windows to send command-line args into Perl as UTF-8:
A manifest needs to be added to both perl.exe and wperl.exe (if you use that), which tells Windows to use UTF-8 as the active code page (ACP) when executing the perl.exe application. This will tell Windows to pass command line arguments into perl as UTF-8 instead of CP1252.
Changes that Need to be Made
Create the manifest file(s)
Go to the location of your perl.exe
(and wperl.exe
) and create a file in that (...\bin
) directory with the following contents, calling it perl.exe.manifest
:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
<assemblyIdentity type="win32" name="perl.exe" version="6.0.0.0"/>
<application>
<windowsSettings>
<activeCodePage
xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings"
>UTF-8</activeCodePage>
</windowsSettings>
</application>
</assembly>
If you also want to modify wperl.exe
copy the above file perl.exe.manifest
to wperl.exe.manifest
and modify that file, replacing the assemblyIdentity
line:
<assemblyIdentity type="win32" name="perl.exe" version="6.0.0.0"/>
with (notice the change of the value assigned to the name
attribute from perl.exe
to wperl.exe
):
<assemblyIdentity type="win32" name="wperl.exe" version="6.0.0.0"/>
Embed the Manifests in the Executable(s)
The next step is to take the manifest file(s) we just created and embed them in their respective executable(s). Before doing this, be sure to backup the original executables, just in case!
The manifest(s) can be embedded into the executable(s) as follows:
For perl.exe
:
mt.exe -manifest perl.exe.manifest -outputresource:perl.exe;#1
For wperl.exe
(optional, needed only if you use wperl.exe
):
mt.exe -manifest wperl.exe.manifest -outputresource:wperl.exe;#1
If you don't already have the mt.exe
executable, it can be found as part of the Windows 10 SDK, presently located at: Download Windows 10 SDK at developer.microsoft.com
Rudimentary Testing and Usage
After making the above changes, UTF-8 command line args become super easy!
Take the following script, simple-test.pl
:
use strict;
use warnings;
use v5.32; # Or whatever recent version of Perl you have
# Helper subroutine to provide simple hex table output formatting
sub hexdump
{
my ($arg)=@_;
sub BYTES_PER_LINE {16}; # Output 16 hex pairs per line
for my $chr_idx (0 .. length($arg)-1)
{
# Break into groups of 16 hex digit pairs per line
print sprintf('\n %02x: ', $chr_idx+1/BYTES_PER_LINE)
if $chr_idx%BYTES_PER_LINE==0;
print sprintf('%02x ',ord(substr($arg,$chr_idx,1)));
}
say '';
}
# Test app code that makes no mention of Windows, ACPs, or UTF-8 outside
# of stuff that is printed. Other than the call out to chcp to get the
# active code page for informational purposes, it is not particularly tied
# to Windows, either, as long as whatever environment it is run on
# passes the script its arg as UTF-8, of course.
my $arg=shift @ARGV or die 'No argument present';
say "Argument: $arg";
say "Argument byte length: ${\length($arg)} bytes";
print 'Argument UTF-8 data bytes in hex:';
hexdump($arg);
Let's test our script, making sure that we are in the UTF-8 code page (65001):
v:\videos>chcp 65001 && perl.exe simple-test.pl "Работа с ’ vis-à-vis 0's using UTF-8"
Output (assuming your console font can handle the special chars):
Active code page: 65001
Argument: Работа с ’ vis-à-vis 0's using UTF-8
Argument byte length: 54 bytes
Argument UTF-8 data bytes in hex:
00: d0 a0 d0 b0 d0 b1 d0 be d1 82 d0 b0 20 d1 81 20
10: f0 9d 9f 98 e2 80 99 f0 9d 99 a8 20 76 69 73 2d
20: c3 a0 2d 76 69 73 20 30 27 73 20 75 73 69 6e 67
30: 20 55 54 46 2d 38
I hope that my solution will help others that run into this issue.