5

Note below how ã changes to a. NOTE2: Before you blame this on CMD.EXE and Windows pipe weirdness, see Experiment 2 below which gets a similar problem using File::Find.

The particular problem I'm trying to fix involves working with image files stored on a local drive, and manipulating the file names which may contain foreign characters. The two experiments shown below are intermediate debugging steps.

The ã character is common in latin languages. e.g. http://pt.wikipedia.org/wiki/Cão

Experiment 1

Look closely, note how cão becomes cao. alt text

Experiment 2

Here I tried using File::Find instead of piped input, in case the issue was with the Windows implementation of the | shell operator. The issue actually gets worse, as the ~a becomes Pi: alt text


Debugging update:

I tried some of the tricks listed at http://perldoc.perl.org/perlunicode.html, e.g. use utf8, use feature 'unicode_strings', etc, to no avail.


Environment and Version Info

The OS is Windows 7, 64-bit.

The Perl is:

This is perl 5, version 12, subversion 2 (v5.12.2) built for MSWin32-x64-multi-thread
(with 8 registered patches, see perl -V for more detail)

Copyright 1987-2010, Larry Wall

Binary build 1202 [293621] provided by ActiveState http://www.ActiveState.com
Built Sep  6 2010 22:53:42
Alex R
  • 11,364
  • 15
  • 100
  • 180
  • Have you made any attempt to understand UNICODE and make use of it? – David Heffernan Dec 24 '10 at 16:16
  • @David, no, why should I have to become an expert in UNICODE? Admit it, the Perl developers got this one wrong. In any other programming language, an ã in is an ã out. – Alex R Dec 24 '10 at 16:30
  • @Alex R well, why don't you just carry on revelling in your ignorance! – David Heffernan Dec 24 '10 at 16:32
  • 2
    The same one-liner works as expected at the Linux command terminal (without our explicitly making use of UNICODE) but drops the accent in Windows cmd.exe. I can't explain why. – d5e5 Dec 24 '10 at 16:32
  • @David, if you know the answer, please post it. Otherwise you're just trying to defend crappy software by blaming it on the user. – Alex R Dec 24 '10 at 16:49
  • 1
    I think cmd.exe is likely to be problematic. It is very hard to get UNICODE out of the Python interpreter to cmd.exe. One could always try to re-direct to a file and remove cmd.exe from the picture. – David Heffernan Dec 24 '10 at 16:50
  • @David, I thought about that, but an experiment demonstrates that the problem is likely not CMD.EXE - see my update using File::Find. – Alex R Dec 24 '10 at 16:53
  • @Alex R off the top of my head I don't know the solution to your problem, whatever it is, but it's your problem not mine! If it was my problem I'd solve it and first of all I'd do some web searches on the topic of UNICODE, Perl and cmd.exe. Have you done this? What have you learnt? – David Heffernan Dec 24 '10 at 16:55
  • @Alex R By the way, Perl is absolutely not crappy software. I'd love to know what you have on your resume that is better than Perl. – David Heffernan Dec 24 '10 at 16:58
  • @David, Perl is actually my favorite programming language, although I get paid for writing Java. Now that sucks doesn't it :-) – Alex R Dec 24 '10 at 16:59
  • 2
    Maybe a less provocative question title would encourage more constructive answers. Perl is unicode-aware and your example works fine for me on OS X; another correspondent has noted that it works fine in Linux too. It doesn't appear that Perl is the problem here. – Simon Whitaker Dec 24 '10 at 17:00

2 Answers2

6

Perl, as with many other scripting languages, is built on the C runtime.

On Windows, the standard MS C runtime for narrow (byte) characters uses an encoding which defaults to the Windows system encoding (‘ANSI code page’) for IO activities such as opening files or writing to the console.

The ANSI code page is always a locale-specific encoding: usually single-byte, but multi-byte in some locales (eg China, Japan etc). It is never UTF-8 or anything else capable of reproducing the whole of Unicode; which characters Perl IO can cope with is dependent on the Windows locale (“language for non-Unicode programs” setting).

Whilst console apps can be given UTF-8 using the chcp 65001 command, there are a number of serious inconsistencies which come up with doing this. This causes difficulty for a lot of tools on Windows and is something Microsoft really needs to fix, but so far their attitude is that Unicode Equals UTF-16; everyone who wants Unicode to work must use the widechar interfaces.

So you won't currently be able to deal with files that use non-ASCII filenames reliably in Perl on Windows. Sorry.

You could try Python (which added special Windows-only filename handling to get around this problem in version 2.3 onwards; see PEP 277), or one of the Unicode-aware Windows Scripting Host languages. Either way, getting Unicode out to the console on Windows still has more pitfalls.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • Thanks for the reference to "chcp"... that has led me to some possible light at the end of the tunnel. I'm now studying this: http://stackoverflow.com/questions/1427796/batch-file-encoding which appears to be closely related to my problem. – Alex R Dec 24 '10 at 22:58
  • thanks for the tip. My problem has been solved! Executing "chcp 1252" in CMD.EXE prior to executing perl causes both Experiment 1 and Experiment 2 to succeed. Adding { system 'chcp 1252'; } to the script itself also works. That enabled me to continue debugging my actual script (not shown here) and actually got it working. – Alex R Dec 24 '10 at 23:23
  • OK, that'll at least ensure the code pages are consistent so it'll work for cp1252 (Western European) characters. It'll not help for Unicode characters outside of that small selection, though. – bobince Dec 24 '10 at 23:28
  • yeah, Western European (latin languages) is good enough for me. No need to support Chinese, Arabic, etc. Thanks again – Alex R Dec 24 '10 at 23:37
1

The following 3 liner works as expected on my newly minted ActivePerl 5.12.2:

use utf8;
open($file, '>:encoding(UTF-8)', "output.txt") or die $!;
print $file "さっちゃん";

I think the culprit is cmd.exe.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • Thanks. I guess what I need is the equivalent of '>:encoding(UTF-8)' in File::Find. – Alex R Dec 24 '10 at 17:21
  • @Alex R if you had any class you'd give me an up-vote and retract your criticism of the Perl developers!! ;-) – David Heffernan Dec 24 '10 at 17:34
  • @Alex R. No, not really. You still haven't said exactly what you are trying to do. Do you really need to print the file names on the command line? – Dan Dec 24 '10 at 21:20
  • @Dan, my script needs to rename files based on a regex based algorithm. Printing on the console is just an intermediate debugging step. – Alex R Dec 24 '10 at 22:39