4

I am trying to pass in a string that uses the UNICODE character: "right single quotation mark" Decimal: 8217 Hex: \x{2019}

Perl is not receiving the character correctly. Let me show you the details:

Perl Script follows (we'll call it test.pl):

use warnings;
use strict;
use v5.32;
use utf8; # Some UTF-8 chars are present in the code's comments

# Get the first argument
my $arg=shift @ARGV or die 'This script requires one argument';

# Get some env vars with sensible defaults if absent
my $lc_all=$ENV{LC_ALL} // '{unset}';
my $lc_ctype=$ENV{LC_CTYPE} // '{unset}';
my $lang=$ENV{LANG} // '{unset}';

# Determine the current Windows code page
my ($active_codepage)=`chcp 2>NUL`=~/: (\d+)/;

# Our environment
say "ENV: LC_ALL=$lc_all LC_CTYPE=$lc_ctype LANG=$lang";
say "Active code page: $active_codepage"; # Note: 65001 is UTF-8

# Saying the wrong thing, expected: 0’s    #### Note: Between the '0' and the 's'
#   is a "right single quotation mark" and should be in utf-8 => 
#   Decimal: 8217 Hex: \x{2019}
# For some strange reason the bytes "\x{2019}" are coming in as "\x{92}" 
#   which is the single-byte CP1252 representation of the character "right 
#   single quotation mark"
# The whole workflow is UTF-8, so I don't know where there is a CP1252 
#   translation of the input argument (outside of Perl that is)

# Display the value of the argument and its length
say "Argument: $arg length: ",length($arg);

# Display the bytes that make up the argument's string
print("Argument hex bytes:");
for my $chr_idx (0 .. length($arg)-1)
{
  print sprintf(' %02x',ord(substr($arg,$chr_idx,1)));
}
say ''; # Newline

I run the Perl script as follows:

V:\videos>c:\perl\5.32.0\bin\perl test.pl 0’s

Output:

ENV: LC_ALL=en-US.UTF-8 LC_CTYPE={unset} LANG=en_US.UTF-8
Argument: 0s length: 3
Argument hex bytes: 30 92 73

OK, perhaps we also need to specify UTF-8 everything (stdin/out/err and command line args)?

V:\videos>c:\perl\5.32.0\bin\perl -CSDA test.pl 0’s

Output:

ENV: LC_ALL=en-US.UTF-8 LC_CTYPE={unset} LANG=en_US.UTF-8
Active code page: 65001
Argument: 0s length: 3
Argument hex bytes: 30 92 73

OK, let's try completely remove all LC*/LANG env vars, resulting in:

@SET LC_ALL=
@SET LANG=

@REM Proof that everything has been cleared
@REM Note: The caret before the vertical bar escapes it,
@REM       because I have grep set up to run through a
@REM       batch file and need to forward args
@set | grep -iP "LC^|LANG" || echo %errorlevel%

Output:

1

Let's try executing the script again, with UTF-8:

V:\videos>c:\perl\5.32.0\bin\perl -CSDA 0’s

Output (no change, other than that the LC*/LANG env vars have been cleared):

ENV: LC_ALL={unset} LC_CTYPE={unset} LANG={unset}
Active code page: 65001
Argument: 0s length: 3
Argument hex bytes: 30 92 73

At this point, I decided to go outside of Perl and see what Windows 10 itself is doing with my command line argument. I have a little utility I wrote in C# a while back that helps troubleshoot command line argument issues and used that to test. The output should be self explanatory:

V:\videos>ShowArgs 0’s

Filename: |ShowArgs.exe|
Pathname: |c:\bin\ShowArgs.exe|
Work dir:  |V:\videos|

Command line: ShowArgs  0’s

Raw command line characters:

000: |ShowArgs  |: S (083:53) h (104:68) o (111:6F) w (119:77) A (065:41) r (114:72) g (103:67) s (115:73)   (032:20)   (032:20)
010: |0’s       |: 0 (048:30) ’ (8217:2019) s (115:73)

Command line args:

00: |0’s|

This shows several things:

  1. The argument passed in does not need to be quoted (I didn't think it would)
  2. The argument is being correctly passed in, in UTF-8 to the application by Windows

I can't for the life of me figure out why Perl is not receiving the argument as UTF-8 at this point.

Of course as an absolute hack, if I was to throw in the following at the bottom of my Perl script, the issue would get resolved. But I would like to understand why Perl is not receiving the argument as UTF-8:

# ... Appended to original script shown at top ...
use Encode qw(encode decode);

sub recode 
{ 
  return encode('UTF-8', decode( 'cp1252', $_[0] ));
}

say "\n@{['='x60]}\n"; # Output separator
say "Original arg: $arg";
say "After recoding CP1252 -> UTF-8: ${\recode($arg)}";

Script execution:

V:\videos>c:\perl\5.32.0\bin\perl test.pl 0’s

New output:

ENV: LC_ALL=en_US.UTF-8 LC_CTYPE={unset} LANG={unset}
Active code page: 65001
Argument: 0s length: 3
Argument hex bytes: 0030 0092 0073

============================================================

Original arg: 0s
After recoding CP1252 -> UTF-8: 0’s

UPDATE

I built a simple C++ test app to get a better handle on what is happening.

Here is the source code:

#include <cstdint>
#include <cstring>
#include <iostream>
#include <iomanip>

int main(int argc, const char *argv[])
{
  if (argc!=2)
  {
    std::cerr << "A single command line argument is required\n";
    return 1;
  }

  const char *arg=argv[1];
  std::size_t arg_len=strlen(arg);

  // Display argument as a string
  std::cout << "Argument: " << arg << " length: " << arg_len << '\n';

  // Display argument bytes
  // Fill with leading zeroes
  auto orig_fill_char=std::cout.fill('0');

  std::cout << "Bytes of argument, in hex:";
  std::cout << std::hex;
  for (std::size_t arg_idx=0; arg_idx<arg_len; ++arg_idx)
  {
    // Note: The cast to uint16_t is necessary because uint8_t is formatted 
    //       "specially" (i.e., still as a char and not as an int)
    //       The cast through uint8_t is necessary due to sign extension of
    //       the original char if going directly to uint16_t and the (signed) char
    //       value is negative.
    //       I could have also masked off the high byte after the cast, with
    //       insertion code like (Note: Parens required due to precedence):
    //         << (static_cast<uint16_t>(arg[arg_idx]) & 0x00ff)
    //       As they say back in Perl-land, "TMTOWTDI!", and in this case it
    //       amounts to the C++ version of Perl "line noise" no matter which
    //       way you slice it. :)
    std::cout << ' ' 
              << std::setw(2) 
              << static_cast<uint16_t>(static_cast<uint8_t>(arg[arg_idx])); 
  }
  std::cout << '\n';

  // Restore the original fill char and go back to decimal mode
  std::cout << std::setfill(orig_fill_char) << std::dec;
}

Built as 64-bit console based application with the MBCS character set setting, the above code was run with:

testapp.exe 0’s

..., and produced the following output:

Argument: 0s length: 3
Argument bytes: 30 92 73

So, it is Windows, after all, at least in part. I need to build a UNICODE character set version of this app and see what I get.

Final Update on How to Fix This Once and for All

Thanks to Eryk Sun's comments to ikegami's accepted answer and links in that answer, I have found the best solution, at least with regard to Windows 10. I will now outline the specific steps to follow to force Windows to send command-line args into Perl as UTF-8:

A manifest needs to be added to both perl.exe and wperl.exe (if you use that), which tells Windows to use UTF-8 as the active code page (ACP) when executing the perl.exe application. This will tell Windows to pass command line arguments into perl as UTF-8 instead of CP1252.

Changes that Need to be Made

Create the manifest file(s)

Go to the location of your perl.exe (and wperl.exe) and create a file in that (...\bin) directory with the following contents, calling it perl.exe.manifest:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
  <assemblyIdentity type="win32" name="perl.exe" version="6.0.0.0"/>
  <application>
    <windowsSettings>
      <activeCodePage
        xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings"
      >UTF-8</activeCodePage>
    </windowsSettings>
  </application>
</assembly>

If you also want to modify wperl.exe copy the above file perl.exe.manifest to wperl.exe.manifest and modify that file, replacing the assemblyIdentity line:

  <assemblyIdentity type="win32" name="perl.exe" version="6.0.0.0"/>

with (notice the change of the value assigned to the name attribute from perl.exe to wperl.exe):

  <assemblyIdentity type="win32" name="wperl.exe" version="6.0.0.0"/>

Embed the Manifests in the Executable(s)

The next step is to take the manifest file(s) we just created and embed them in their respective executable(s). Before doing this, be sure to backup the original executables, just in case!

The manifest(s) can be embedded into the executable(s) as follows:

For perl.exe:

mt.exe -manifest perl.exe.manifest -outputresource:perl.exe;#1

For wperl.exe (optional, needed only if you use wperl.exe):

mt.exe -manifest wperl.exe.manifest -outputresource:wperl.exe;#1

If you don't already have the mt.exe executable, it can be found as part of the Windows 10 SDK, presently located at: Download Windows 10 SDK at developer.microsoft.com

Rudimentary Testing and Usage

After making the above changes, UTF-8 command line args become super easy!

Take the following script, simple-test.pl:

use strict;
use warnings;
use v5.32; # Or whatever recent version of Perl you have

# Helper subroutine to provide simple hex table output formatting
sub hexdump
{
  my ($arg)=@_;
  sub BYTES_PER_LINE {16}; # Output 16 hex pairs per line

  for my $chr_idx (0 .. length($arg)-1)
  {
    # Break into groups of 16 hex digit pairs per line
    print sprintf('\n  %02x: ', $chr_idx+1/BYTES_PER_LINE)
      if $chr_idx%BYTES_PER_LINE==0;
    print sprintf('%02x ',ord(substr($arg,$chr_idx,1)));
  }
  say '';
}

# Test app code that makes no mention of Windows, ACPs, or UTF-8 outside
# of stuff that is printed. Other than the call out to chcp to get the
# active code page for informational purposes, it is not particularly tied
# to Windows, either, as long as whatever environment it is run on
# passes the script its arg as UTF-8, of course.
my $arg=shift @ARGV or die 'No argument present';

say "Argument: $arg";
say "Argument byte length: ${\length($arg)} bytes";
print 'Argument UTF-8 data bytes in hex:';
hexdump($arg);

Let's test our script, making sure that we are in the UTF-8 code page (65001):

v:\videos>chcp 65001 && perl.exe simple-test.pl "Работа с ’ vis-à-vis 0's using UTF-8"

Output (assuming your console font can handle the special chars):

Active code page: 65001
Argument: Работа с ’ vis-à-vis 0's using UTF-8
Argument byte length: 54 bytes
Argument UTF-8 data bytes in hex:
  00: d0 a0 d0 b0 d0 b1 d0 be d1 82 d0 b0 20 d1 81 20
  10: f0 9d 9f 98 e2 80 99 f0 9d 99 a8 20 76 69 73 2d
  20: c3 a0 2d 76 69 73 20 30 27 73 20 75 73 69 6e 67
  30: 20 55 54 46 2d 38

I hope that my solution will help others that run into this issue.

Michael Goldshteyn
  • 71,784
  • 24
  • 131
  • 181
  • "92" doesn't correspond to a quote mark in any character set I can find. But you've definitely put U+2019 in your question. Weird. – Schwern Sep 13 '20 at 02:30
  • Windows is natively UTF-16. If Perl supports UTF-8 for command line arguments, environment variables, and console I/O, then it's by transcoding between UTF-16 and UTF-8. One exception is that the console output codepage works with UTF-8 (65001) in Windows 8+, but setting the input codepage to UTF-8 is limited to 7-bit ASCII; non-ASCII characters get read as null bytes. The only reliable way to support UTF-8 for Windows console input and output is to use the wide-character API (e.g. `ReadConsoleW`, `WriteConsoleW`) and transcode between UTF-16 and UTF-8. Python implements this. Does Perl? – Eryk Sun Sep 13 '20 at 02:50
  • @Schwern [Windows CP1252 table at Wikipedia](https://en.wikipedia.org/wiki/Windows-1252#Character_set) look at `\x{92}` - which is row `9_` column `_2` in the table. – Michael Goldshteyn Sep 13 '20 at 02:58
  • Wrote a C++ test app to further test what is happening. At least with an MBCS character set console app, I am seeing the same misbehavior (independent of Perl). Going to try to create a UNICODE character set version and see what results I get. – Michael Goldshteyn Sep 13 '20 at 03:40
  • Added a summary of the ideas presented at the microsoft link ikegami put in his answer. – Michael Goldshteyn Sep 14 '20 at 00:17

2 Answers2

3

Every Windows system call that deals with strings comes in two varieties: An "A"NSI version that uses the Active Code Page (aka ANSI Code Page), and a "W"ide version that uses UTF-16le.[1] Perl uses the A version of all system calls. That includes the call to get the command line.

The ACP is hard-coded. (Or maybe Windows asks for the system language during setup and bases it on that? I can't remember.) For example, it's 1252 on my system, and there's nothing I can do to change that. Notably, chcp has no effect on the ACP.

At least, that was the case until recently. The May 2019 update to Windows added the ability to change the ACP on a per-application basis via its manifest. (The page indicates that it's possible to change the manifest of an existing application.)

chcp changes the console's CP, but not the encoding used by the A system calls. Setting it to a code page that contains ensures that you can type in , and that Perl can print out a (if properly encoded).[2] Since 65001 contains , you have no problems doing those two things.

The choice of console's CP (as set by chcp) has no effect on how Perl receives the command line. Because Perl uses the A version of the system calls, the command line will be encoded using the ACP regardless of the console's CP and the OEM CP.


Based on the fact that fact that is encoded as 92, your system appears to use 1252 for its Active Code Page as well. As such, you could resolve your problem as follows:

use Encode qw( decode );

my @ARGV = map { decode("cp1252", $_) } @ARGV;

See this post for a more generic and portable solution which also adds the appropriate encoding/decoding layer to STDIN, STDOUT and STDERR.


But what if you wanted to support arbitrary Unicode characters instead of being limited to those found in your system's ACP? As mentioned above, you could change perl's ACP. Changing it to 650001 (UTF-8) would give you access to the entire Unicode character set.

Short of doing that, you would need to get the command line from the OS using the W version of the system call and parse it.

While Perl uses the A version of system calls, this doesn't limit modules from doing the same. They may use W system calls.[3] So maybe there's a module that does what you need. If not, I've previously written code that does just that.


Many thanks to @Eryk Sun for the input they provided in the comments.


  • The ACP can be obtained using Win32::GetACP().
  • The OEM CP can be obtained using Win32::GetOEMCP().
  • The console's CP can be obtained using Win32::GetConsoleCP() / Win32::GetConsoleOutputCP().

  1. SetFileApisToOEM can be used to change the encoding used by some A system calls to the OEM CP.[2]
  2. The console's CP defaults to the system's OEM CP. This can be overridden by changing the CodePage value of the HKCU\Console\<window title> registry key, where <window title> is the initial window title of the console. Of course, it can also be overridden using chcp and the underlying system calls it makes.
  3. Notably, see Win32::LongPath.
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • The process OEM codepage (i.e. `CP_OEMCP` and [`GetOEMCP`](https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getoemcp)) defaults to the system OEM codepage. In Windows 10, the ANSI (`CP_ACP`) and OEM codepages can be set to UTF-8 at the system level or in the application manifest via the ["activeCodePage" setting](https://learn.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page). – Eryk Sun Sep 13 '20 at 21:27
  • Most multibyte API functions use the [A]NSI codepage (e.g. `CreateProcessA`), but the filesystem API can be switched to OEM via [`SetFileApisToOEM`](https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-setfileapistooem). The console's input or output codepage (respectively `GetConsoleCP` and `GetConsoleOutputCP`) default to the OEM codepage of the conhost.exe process, unless specified differently in the "CodePage" value that can be set in the registry key for the initial window title at "HKCU\Console\". – Eryk Sun Sep 13 '20 at 21:29
  • If OEM is set to UTF-8 at the system level, it's broken as the default codepage in the console. The console still does not support UTF-8 for the input codepage (i.e. `GetConsoleCP`) with multibyte `ReadFile` or `ReadConsoleA`, in which case it reads non-ASCII characters as null bytes. With system OEM set to UTF-8, the "CodePage" value has to be set for each console window in which one needs non-ASCII multibyte input. This doesn't affect applications that use the console's wide-character (UTF-16) API (e.g. `ReadConsoleW`), such as normal console I/O in Python and PowerShell. – Eryk Sun Sep 13 '20 at 21:31
  • @Eryk Sun, I don't know what it uses, but Perl is capable of reading UTF-8 from the console if it's set to 65001. – ikegami Sep 13 '20 at 21:49
  • @Eryk Sun, OMG! `SetFileApisToOEM` makes so much sense! But does it affect `GetCommandLineA`? I'm guessing it doesn't since `GetCommandLineA` doesn't technically return file names. Also, a script calling `SetFileApisToOEM` would surely do it too late to affect `@ARGV`. – ikegami Sep 13 '20 at 22:17
  • @Eryk Sun, OMG! Looks like you can change an existing executable's manifest. Changing Perl's to change its ACP to UTF-8 would make so much sense! – ikegami Sep 13 '20 at 22:18
  • @Eryk Sun, I have updated my answer to account for the information you posted. Many thanks! – ikegami Sep 13 '20 at 22:18
  • I will give you the checkmark and update my answer with the correct solution which is quite different from yours. However, Eryk Sun provided just the hint needed to fix this problem for all Perl scripts, at least on Windows 10! Please update your answer to refer to the bottom of the original question for the correct Windows 10 solution, so people are not confused... – Michael Goldshteyn Sep 13 '20 at 22:37
  • @Michael Goldshteyn, "Building a UNICODE character set version of this app" makes no sense. That would involve making major, backwards-incompatible changes to Perl. The actual solution and what Eryk Sun pointed out can be done is to change the app's ACP. This is already mentioned in my answer – ikegami Sep 13 '20 at 22:47
  • @ikegami, that is not what I intend to do at all, see my updated question with the far more plausible solution presented at the bottom. – Michael Goldshteyn Sep 13 '20 at 23:34
  • ok? Not sure why you want me to look at you doing exactly what I suggested you should do? Again, this is already in the answer since before you asked to have it added to the answer. ("*But what if you wanted to support arbitrary Unicode characters instead of being limited to those found in your system's ACP? As mentioned above, you could [change](https://learn.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page) `perl`'s ACP.*") – ikegami Sep 13 '20 at 23:37
  • Well, it's succinct and guides other developers step by step on how to enact the manifest based changes, so that it's easy, that's all. – Michael Goldshteyn Sep 13 '20 at 23:41
  • hmmm, I didn't ask why you paraphrased MS's doc into your answer. I wondering why you're telling *me* all of this. – ikegami Sep 13 '20 at 23:42
  • @ikegami, I didn't ask to have it added to your answer, I said I will add it to my question as an outline at the end with the steps - no reason to be rude. Have a good day. – Michael Goldshteyn Sep 13 '20 at 23:43
  • You said: "*Please update your answer to refer to the bottom of the original question for the correct Windows 10 solution, so people are not confused.*" Said solution was already in my answer. – ikegami Sep 13 '20 at 23:44
  • Maybe instead of using the word _correct_, I should have said _succinct and Perl specific_. Anyhow, the checkmark is yours. Thanks for the help. – Michael Goldshteyn Sep 13 '20 at 23:46
  • "*Please update your answer to refer to the bottom of the original question for the succint and Perl-specific Windows 10 solution, so people are not confused.*" huh? What you posted is not succinct, and it's not Perl-specific. Now I'm confused. – ikegami Sep 13 '20 at 23:47
  • I don't know what you are reading, but I described the actual content of the manifest file for `perl.exe` and `wperl.exe`, along with a sample script that shows how much simpler UTF-8 param handling becomes once the manifest is added. – Michael Goldshteyn Sep 13 '20 at 23:49
0

use utf8 only makes Perl accept UTF-8 syntax like in variable names and functions. Everything else is untouched, including @ARGV. So my $arg=shift @ARGV is reading raw bytes.

Unicode in Perl is complicated. Simplest thing to do is to use utf8::all instead which turns on UTF-8 for syntax, all filehandles, @ARGV and everything else.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • I think you meant `use utf8 qw(:all);` and that did not change a thing. I think the perl command line switch `-CSDA` does exactly the same thing and I did try that with no change. – Michael Goldshteyn Sep 13 '20 at 02:57
  • @MichaelGoldshteyn No, I meant `utf8::all`. There's a link and everything. However, you're right that -CSDA should have done it. Give utf8::all a shot anyway. – Schwern Sep 13 '20 at 03:05
  • OK, installed utf8::all, got: `UTF-8 "\x92" does not map to Unicode at c:/perl/site/5.32.0/lib/utf8/all.pm line 231` Oh well... Back to square one. – Michael Goldshteyn Sep 13 '20 at 03:09
  • @MichaelGoldshteyn That indicates your input is CP1252 `\x{92}`, not UTF-8 `\x{2019}`. – Schwern Sep 13 '20 at 03:11
  • Figured out that much based on the _hackish_ code at the bottom of my question, which does a conversion from CP1252 -> UTF-8, and produces correct outputt. But, the big question is: Why?? Especially given that `chcp` correctly reports 65001 and my `ShowArgs` tool correctly shows UTF-8 data for the command line arg? – Michael Goldshteyn Sep 13 '20 at 03:13
  • @MichaelGoldshteyn What does `chcp` say if you run it from the command line? – Schwern Sep 13 '20 at 03:14
  • Thank you for mentioning `utf::all`! And thank you for contribuing to it, of course :) I didn't know about it -- it seems very handy and useful! – zdim Sep 13 '20 at 04:28
  • chcp says 65001 (just like it does when run from the script). – Michael Goldshteyn Sep 13 '20 at 06:19
  • `chcp` won't help here, and utf8::all isn't going to be very useful on a Windows system. See [my answer](https://stackoverflow.com/a/63868721/589924) for details. – ikegami Sep 13 '20 at 08:45