Perl newbie first experience with Unicode (in filename, -e operator, open operator, and cmd window)

Question

I have a Windows Perl (5.16.1 32 bit) program that opens a media file and (using ffmpeg) it extracts segments of audio - the purpose of which is to convert a single album music track (containing multiple songs) into multiple individual song files.

When the name of the media file to be processed is all ASCII characters, this all works rather well. I recently tried this program against a filename that includes Russian characters, and the program fails miserably in several areas.

While this must have to do with Unicode, and as I have never previously needed to do anything with Unicode - I am rather confused about the various aspects of failures that I am experiencing here, nor do I know the fix for the variety of issues I am now facing.

I have distilled this down to the minimum to demonstrate the problems.

If I open a cmd window, and type 'chcp', the return value is 437.

If I do a 'dir' command, this is what is shown for me:

04/01/2019  11:46 AM        71,982,427 IC3PEAK альбом Сладкая.mkv
06/10/2020  10:42 PM               275 test.pl

(Note how in my cmd window, the Russian characters do display as Russian characters.)

My 'test.pl' Perl script is here:

use open ":std", ":encoding(UTF-8)";

$media = "IC3PEAK альбом Сладкая.mkv";

if (-e $media) {
   print "Media file does exist\n";
} else {
   print "Media file does NOT exist\n";
}

open(IN, $media) || die "Media file ($media) can not be opened!\n";

When this Perl script runs, using default chcp value of 437, I get this as output:

Media file does NOT exist
Media file (IC3PEAK ├É┬░├É┬╗├æ┬î├É┬▒├É┬╛├É┬╝ ├É┬í├É┬╗├É┬░├É┬┤├É┬║├É┬░├æ┬Å.mkv) can not be opened!

If I run 'chcp 1250' in my cmd window, and I re-run this Perl script, I get this as output:

Media file does NOT exist
Media file (IC3PEAK ĂÂ°ĂÂ»Ă‘ÂŚĂÂ±ĂÂľĂÂĽ ĂÂˇĂÂ»ĂÂ°ĂÂ´ĂÂşĂÂ°Ă‘ÂŹ.mkv) can not be opened!

Problem 1: I am told the media file does not exist.

Problem 2: When I print the media file name to STDOUT, notice how the displayed file name non longer matches how it looks when I did the 'dir' command?

Can anyone suggest how to fix these two problems?

PS - Noting, when I change the disk file name to pure ASCII 'IC3PEAK.mkv', and change the $media variable to also equal 'IC3PEAK.mkv', running the modified Perl script gives:

Media file does exist

That surely also needs `use utf8;`, to tell the interpreter that the source itself is UTF-8. So add that and let us know how it goes...? — zdim, Jun 11 '20 at 06:27
Tip: Avoid global vars. `open(IN, ...)` should be `open(my $IN, ...)` — ikegami, Jun 11 '20 at 21:25
Tip: Avoid 2-arg `open`. `open(my $IN, $media)` should be `open(my $IN, '<', $media)` — ikegami, Jun 11 '20 at 21:25
Tip: Include the error reason in your error messages. `open(...) or die("Can't open \"$media\": $!\n");` — ikegami, Jun 11 '20 at 21:26
1) merely adding 'use utf8;' to begin of script did not change the result of my originally reported problem 1 or problem2. 2) Despite installing the Win32 module using PPM, adding 'use Win32;', 'use Encode qw( encode );' and 'encode("cp", Win32::GetAPC(), $media);' gave this error 'Undefined subroutine &Win32::GetAPC called at C:\data\srcperl\SplitAlbumIntoSongs\test.pl line 8' Has anyone modified my 'test.pl' script in such a way where it gets past problem 1 and problem 2? If so, what is the corrected script? and/or must I execute a different 'chcp' command? — user1232031, Jun 12 '20 at 05:00
Re "*merely adding 'use utf8;' [didn't solve either problem]*" Correct. While a required fix, it's not a complete fix. See my answer. — ikegami, Jun 12 '20 at 06:37
Re "*Undefined subroutine &Win32::GetAPC*", The sub is named `GetACP` (Get Active Code Page), not `GetAPC` — ikegami, Jun 12 '20 at 06:38

Polar Bear · Answer 1 · 2020-06-12T08:03:50.430

2

Following code was tested in Windows 10 1903, perl -MWin32 -e"CORE::say Win32::GetACP()" returns ACP 1252 (Win 10 North America) with Win32 strawberry-perl 5.30.2.1 #1 Tue Mar 17 03:21:32 2020 x64.

Initial attempt to install cpan Win32::Unicode::File failed with t/04_print.t (Wstat: 768 Tests: 13 Failed: 3) message.

A quick search in Google lead to following post on Perl Monks. It looks like the problem with Win32::Unicode::File installation is known for some time.

NOTE: ikegami pointed out that the module can be forcefully installed and failed test can be ignored. Please see his comment bellow.

Following test code confirms that a forced installation cpan -f -i Win32::Unicode::File produces desired outcome.

use strict;
use warnings;
use feature 'say';

use utf8;

use Win32::Console;
use Win32::Unicode::File;

Win32::Console::OutputCP( 65001 );

binmode STDOUT, ':encoding(UTF-8)';
binmode STDERR, ':encoding(UTF-8)';

my $fname = 'Доброе утро Россия.mkv';
my $fh = Win32::Unicode::File->new;

open $fh, '<:encoding(UTF-8)', $fname 
    or die "Can't open $fname $!";

while( <$fh> ) {
    say;
}

close $fh;

Content of input file Доброе утро Россия.mkv is

Доброе утро Россия

As suggested in above mentioned post I resorted to try Win32::LongPath as an alternative. Installation of the module went successfully through.

use strict;
use warnings;
use feature 'say';

use utf8;

use Win32::Console;
use Win32::LongPath;

Win32::Console::OutputCP( 65001 );

binmode STDOUT, ':encoding(UTF-8)';
binmode STDERR, ':encoding(UTF-8)';

my $fname = 'IC3PEAK альбом Сладкая.mkv';
my $fh;

openL \$fh, '<:encoding(UTF-8)', $fname
    or die "Can't open $fname ($^E)";

while( <$fh> ) {
    # process input
    say;
}

close $fh;

Instead of real file IC3PEAK альбом Сладкая.mkv a text file with same name was used in the test with following content

Привет Москва

Note: use openL \$fh, '<', $fname on real mkv file to read content of the file

edited Jun 12 '20 at 08:03

answered Jun 12 '20 at 05:18

Polar Bear

6,762
1
5
12

Re "*Initial attempt to install `cpan Win32::Unicode::File` failed*", As explained [here](https://github.com/xaicron/p5-win32-unicode/issues/7), it's a bad test, not a problem with the module itself. The failure can safely be ignored. – ikegami Jun 12 '20 at 06:36
The new script provided by #ikegami does run. Thanks! But I have 2 questions. 1) why the need for ' perl -MWin32 -e"CORE::say Win32::GetACP()" ' , as I see your value of 1252 is not referenced anywhere in the script. That one-liner also returned 1252 when ran on my pc. and 2) is there no solution to get the built-in file test operator '-e' (does file exist) to work when the filename contains NON-ANSI characters? – user1232031 Jun 12 '20 at 15:04
Code page 1252 does not include Cyrillic letters. It means that perl build for your system can not use without special trick `stat/open/close/read` from filesystem ast it uses Windows API functions with A suffix (CreateFileA). To achieve desired result we have to resort to additional module which has an access to Windows API functions with W suffix (CreateFileW). The command `perl -MWin32 -e"CORE::say Win32::GetACP()" ` gives as a clue what system we deal with. For example if it returns code page 1215 then we could use `stat/open/close/read` directly. – Polar Bear Jun 12 '20 at 16:01
If you would install Russian version of Windows then you would have code page 1251 and your code would work as it is (`use strict; use warnings;` recommended and possibly `use diagnostics;` in case of complicated errors). Your system uses code page 1252 due this reason you have to use `Win32::Unicode` or `Win32::LongPath` which provide interface to Windows API to functions with suffix W (CreateFileW). – Polar Bear Jun 12 '20 at 16:07
See documentation for [Win32::Unicode::File](https://metacpan.org/pod/Win32::Unicode::File) almost at very end there is a section **file_type('attribute', $file_or_dir)** which demonstrates how to check if file exists or it is directory. See documentation for [Win32::LongPath](https://metacpan.org/pod/Win32::LongPath) sections **testL** and **statL** which demonstrates how to check if file exits or it is a directory. I also recommend to visit [Unicode issues in Perl](https://www.i-programmer.info/programming/other-languages/1973-unicode-issues-in-perl.html) web page. – Polar Bear Jun 12 '20 at 16:13
Note: I wrote new script, but ikegami indicated that `Win32::Unicode::File` can be installed forcefully ignoring failed test. I wrote the code to confirm that his statement holds the ground and works properly. Just for reference `Win32::Unicode::File` was last time updated in 2012 and the issue with failed test was ongoing for a long time -- [tests](http://matrix.cpantesters.org/?dist=Win32-Unicode+0.38). – Polar Bear Jun 12 '20 at 16:19

ikegami · Answer 2 · 2022-08-05T03:38:03.193

Three fixes are needed.

Non-ASCII source without use utf8;

Your source contains non-ASCII characters.

$media = "IC3PEAK альбом Сладкая.mkv";

Perl expects source code to be encoded using ASCII, unless you use use utf8;. Encode your source using UTF-8 and use use utf8;.

use utf8;

# String of decoded text (aka string of Unicode Code Points).
# Length = 26
my $media = "IC3PEAK альбом Сладкая.mkv";

Assuming your file was encoded using UTF-8, what you had was equivalent to the following:

use utf8;
use Encode qw( encode );

# String of text encoded using UTF-8 (aka string of bytes).
# Length = 39
my $media = encode("UTF-8", "IC3PEAK альбом Сладкая.mkv");

Incorrect output encoding

Your code contains

use open ":std", ":encoding(UTF-8)";

This tells Perl the following:

Decode bytes received from STDIN using UTF-8.
Encode characters sent to STDOUT and STDERR using UTF-8.
Do the same for file handles opened in the current lexical scope.

The problem is that your terminal isn't expecting UTF-8. It's expecting cp437 (before chcp 1250) or cp1250 (after chcp 1250).

Solution 1:

Adjust the encoding specified in the use open line. This shows how this can be done without hardcoding the encoding.

Of course, you'll only be able to print the Cyrillic characters if the terminal's OEM code page (as set using chcp) supports the characters. This brings us to a second solution.

Solution 2:

Adjust the terminal to provide/expect UTF-8. This can be done using the following:

chcp 65001

Limitation of builtin functions that accept file names

Windows provides two versions of each functions that accepts strings:

The "UNICODE" version (suffixed with "W" for "wide") accepts/returns strings encoded using UTF-16le. This version supports all Unicode characters.
The "ANSI" version (suffixed with "A") accepts/returns strings encoded using the Active Code Page (ACP). The "A" version only supports a small subset of the Unicode characters.

You can obtain the ACP for your system using the following:
```
perl -MWin32 -e"CORE::say Win32::GetACP()"
```

Unfortunately, Perl functions (named operators) use the "A" version of system calls and expect/return text encoded using the ACP. This severely limits which file names that can be passed to them.

For example, my system's ACP is 1252, so the "A" version of system calls would not support Cyrillic characters. This means there is nothing I can do to make open, -e, etc work with file names containing Cyrillic characters. ouch.

[Upd: I now recommend Win32::LongPath instead.] The Win32-Unicode distribution can help with this. For example, -e is just a call to stat, and Win32::Unicode::File provides statW, a version of stat that accepts file names as decoded text. Similarly, it provides a replacement for open.

Could you clarify a little bit more on [Code Page](https://learn.microsoft.com/en-us/windows/win32/intl/code-pages#:~:text=Windows%20code%20pages%20are%20also,the%20currently%20active%20code%20page.) in Windows? You statement is that if system ACP is 1252 then perl can not manipulate (open/read/close/stat) files if filenames UTF8 encoded in Cyrillic. Is my understanding that if Windows [ACP 1251](https://en.wikipedia.org/wiki/Windows-1251) then Cyrillic filenames could be used in `open/read/close/stat` correct? — Polar Bear, Jun 12 '20 at 00:15
If the Windows system is DBCS based (Double Byte Character Set) is perl able to work with filenames encoded in UTF8. Please expand information a little bit on last two paragraphs of your answer. Would be nice to hear if there is any work around of this limitation. — Polar Bear, Jun 12 '20 at 00:15
Info for reference: [Windows code page](https://en.wikipedia.org/wiki/Windows_code_page#Windows-125x_series) — Polar Bear, Jun 12 '20 at 00:17
@Polar Bear, Re "*Is my understanding that if Windows ACP 1251 then Cyrillic filenames could be used in open/read/close/stat correct?*", Yes. (They would need to be encoded using cp1251) — ikegami, Jun 12 '20 at 02:48
Re "*If the Windows system is DBCS based (Double Byte Character Set)*", I don't have experience with those. Perl uses `CreateFileA` (etc), not `CreateFileW`. Anything `CreateFileA` accepts, Perl does too. If one's ACP is a DBCS (950?), then any character represented by ACP could be provided `open` (etc). — ikegami, Jun 12 '20 at 02:52
Re "*is perl able to work with filenames encoded in UTF8*", I don't think any ACP covers the entire Unicode character set. — ikegami, Jun 12 '20 at 02:53
Re "*Would be nice to hear if there is any work around of this limitation*", My answer already provides that: Use a module that calls `CreateFileW` (etc) instead of `CreateFileA`, such as Win32::Unicode::File. — ikegami, Jun 12 '20 at 02:55

Perl newbie first experience with Unicode (in filename, -e operator, open operator, and cmd window)

2 Answers2

Linked