5

I'm running latest perl under german Windows 7 and I want to use utf8 everywhere in my perl programs (for the script, the file contents, file names, mail texts, etc.).

All works fine, but I'm facing problems when trying to process files having special characters in filename. Even system calls do not work well. So (how) can I tell perl to use utf8 everywhere?

I tried a while with encode and decode but it's very unclear why that works as it works... Also I need to encode('cp850', TEXT) for a correct display in the command prompt window.

Examples:

When I need to copy a file, it only works when I use File::copy(encode("iso-8859-1", $filename), ...) and when I want to work with pdf file contens the successful command is system(encode('cp850', sprintf('pdftk.exe %s...', decode('utf8', $file))));

Why is that (especially the decode in the system call) and is there a more easy way? Maybe something with use open ':encoding...', but I had no luck so far.

toshniba
  • 383
  • 5
  • 18
  • 5
    You can't use UTF-8 for the file names when the filing system itself uses something else. – Borodin Sep 11 '18 at 14:57
  • 1
    See [this answer](https://stackoverflow.com/a/6163129) for one suggestion. – mob Sep 11 '18 at 15:34
  • @brian d foy, [The linked answer](https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default) doesn't even address the OP's example. Reopening. – ikegami Sep 12 '18 at 01:19
  • 2
    1) Perl's support for Unicode on Windows is abysmal. For dealing with files, Perl builtins use the "(A)NSI" interface, but you need to use the "(W)ide" interface to use arbitrary Unicode characters. To do that, you want [Win32-Unicode](https://metacpan.org/release/Win32-Unicode). – ikegami Sep 12 '18 at 01:20
  • 2) I can't think of anything exposing the Wide interface of `CreateProcess`, so you might have to use Win32::API to do so for a Unicode version of `system`. – ikegami Sep 12 '18 at 01:20
  • 3) For STDIN/STDOUT/STDERR, you want to use `chcp 65001` from the concole, and use `use open ':std', ':encoding(UTF-8)';` as a on a UTF-8 unix system – ikegami Sep 12 '18 at 01:20
  • thanks for your contributions. Please see my answer post below. – toshniba Sep 12 '18 at 07:17

3 Answers3

3

Here's the real, concrete and definite answer by someone who just recently went through this exact problem:

You cannot, on windows, have Perl 5.28.0 or below use UTF8 for everything.

This is the why: As of Perl 5.28.0 the perl core file handling functions are fatally fucked for this. Windows stores filenames as (simply put) UTF16, and the windows api wide character functions return file names as wide chars, similar to what Perl already operates with internally. However when getting these from the file system, the perl core converts them into bytes in the encoding of the local system. Vice versa when writing file names. So, morally, you have this kind of flow, paraphrased as Perl:

use utf8;

sub readdir_perl {
    my $dir = shift;
    my $fn = readdir $dir;
    $fn = encode $fn, CP_ACP;
    return $fn;
}

sub open_perl {
    my $fn = shift;
    $fn = decode $fn, CP_ACP;
    open my $FH, $fn;
    return $FH;
}

Two important notes:

  • All of the stuff above is paraphrased. It's roughly how the perl core implements these functions in C, and you cannot usefully change them, nor CP_ACP, for the duration of a program.
  • The conversion from wide chars to CP_ACP is forced through. It doesn't bail on errors. If there are wide chars that cannot be represented usefully, it converts them to a ? character, leaving you with a handful of garbage.

That said, what can you do?

  1. Use Win32::LongPath. It handles most of what you need internally. For files. Be aware that it only works reliably on volumes with shortpaths configured on, which is usually C: and nothing else. Use system as normal, but ensure you treat everything as bytes and decode/encode appropiately. Some example code exists. You'll also need to implement ALL filehandling manually, and you can't usefully monkeypatch other code to use the LongPath functions.
  2. Wait until the perl core is fixed. As far as i know there currently are not any plans to do this anytime soon, as any kind of simple fix is likely to break legacy scripts that rely on the UTF16 to system codepage conversion to usefully munge unicode umlauts into äöü on german systems, etc.
  3. Use a different language. Maybe PowerShell.
Mithaldu
  • 2,393
  • 19
  • 39
  • 2
    This question and answer have spurred an effort to improve the situation in Perl 5.30 – khw Sep 19 '18 at 21:28
1

First set the codepage of your command prompt to 65001

chcp 65001

This will allow you to use and display utf8 characters in the command prompt. File names are dependent on the file system being used. NTFS stores file names using the UTF-16LE encoding. See this question on how to create and access files with Unicode file names on Windows.

System() commands need to be encoded in the same codepage as the command prompt so after doing a chcp 65001 you can encode the system() command in utf8

Community
  • 1
  • 1
JGNI
  • 3,933
  • 11
  • 21
0

As there is no suitable answering post for now, I'll try to write down a working sample here. Hopefully one time it will have no more errors in it. Until then please post your suggestions/solutions which I'll test and update that code on success.

Currently unsolved problems:

  • opening the pdf file by open
  • opening the pdf file by CAM::PDF->new
  • processing the pdf file by system call

test.pl:

$| = 1;
use strict;
use warnings;
use utf8;
use CAM::PDF;
use open ':std', ':encoding(UTF-8)';
BEGIN {
  if ($^O eq "MSWin32") {
    require Win32::Unicode::File;
    Win32::Unicode::File->import();
  }
}

my $file = 'Täst.pdf';
print "FILENAME: $file\n";

unlink("file2.pdf");
copyW($file, "file2.pdf") or print "cannot copy file: $!\n";

if (!open(FH, $file)) {
  print "cannot open file by open '$file': $!\n";
}
else {close FH}

my $pdf = CAM::PDF->new($file) or print "cannot open file by CAM::PDF: $!\n";
print "\n";

system("pdftk.exe $file cat 2 4 output out.pdf") or print "cannot run command: $!\n";
print "\n";

test.cmd:

Requires Font "Lucida Console" to be set for the commandline window.

@echo off
chcp 65001 >nul
call perl.exe test.pl
chcp 850 >nul
pause

Output under Windows:

FILENAME: Täst.pdf

cannot open file by open 'Täst.pdf': No such file or directory

cannot open file by CAM::PDF: No such file or directory

Error: Unable to find file.
Error: Failed to open PDF file:
   Täst.pdf
Drücken Sie eine beliebige Taste . . .
toshniba
  • 383
  • 5
  • 18
  • Error: Failed to open PDF file: is caused I think because the file name is encoded in UTF-8 and the file system expects UTF-16LE. – JGNI Sep 12 '18 at 08:07
  • `open` expects cp1252 or whatever your system's ANSI code page is. You can use `CreateFileW` from Win32API::File instead. The name is a bit confusing. This is the system call for creating *file handles* (i.e. opening files), not creating files. – ikegami Sep 12 '18 at 09:15
  • 1
    CAM::PDF's constructor accepts the contents of the a file (`$self = CAM::PDF->new(content | filename | '-')`). You should read the file yourself (with `:raw`!) and pass what you read to CAM::PDF instead of the file name – ikegami Sep 12 '18 at 09:17
  • As previously mentioned, you'll need to use `CreateProcessW` to pass the file name to `pdftk` (and hope that `pdftk` uses Wide system calls!) – ikegami Sep 12 '18 at 09:18