1

I'm not talking about reading in the file content in utf-8 or non-utf-8 encoding and stuff. It's about file names. Usually I save my Perl script in the system default encoding, "GB2312" in my case and I won't have any file open problems. But for processing purposes, I'm now having some Perl script files saved in utf-8 encoding. The problem is: these scripts cannot open the files whose names consist of characters encoded in "GB2312" encoding and I don't like the idea of having to rename my files.

Does anyone happen to have any experience in dealing with this kind of situation? Thanks like always for any guidance.

Edit

Here's the minimized code to demonstrate my problem:

# I'm running ActivePerl 5.10.1 on Windows XP (Simplified Chinese version)
# The file system is NTFS

#!perl -w
use autodie;

my $file = "./测试.txt"; #the file name consists of two Chinese characters
open my $in,'<',"$file";

while (<$in>){
print;
}

This test script can run well if saved in "ANSI" encoding (I assume ANSI encoding is the same as GB2312, which is used to display Chinese charcters). But it won't work if saved as "UTF-8" and the error message is as follows:

Can't open './娴嬭瘯.txt' for reading: 'No such file or directory'.

In this warning message, "娴嬭瘯" are meaningless junk characters.

Update

I tried first encoding the file name as GB2312 but it does not seem to work :( Here's what I tried:

#!perl -w
use autodie;
use Encode;

my $file = "./测试.txt";
encode("gb2312", decode("utf-8", $file));
open my $in,'<',"$file";

while (<$in>){
print;
}

My current thinking is: the file name in my OS is 测试.txt but it is encoded as GB2312. In the Perl script the file name looks the same to human eyes, still 测试.txt. But to Perl, they are different because they have different internal representations. But I don't understand why the problem persists when I already converted my file name in Perl to GB2312 as shown in the above code.

Update

I made it, finally made it :)

@brian's suggestion is right. I made a mistake in the above code. I didn't give the encoded file name back to the $file.

Here's the solution:

#!perl -w
use autodie;
use Encode;

my $file = "./测试.txt";
$file = encode("gb2312", decode("utf-8", $file));
open my $in,'<',"$file";

while (<$in>){
print;
}
brian d foy
  • 129,424
  • 31
  • 207
  • 592
Mike
  • 1,841
  • 5
  • 24
  • 34
  • 2
    What OS and filesystem are you using? – JB. Nov 16 '09 at 14:29
  • 2
    Can you post the code for opening the files? This may be quite helpful in understanding the issue. – Jack M. Nov 16 '09 at 14:58
  • @JB, I'm running Windows XP (Simplified Chinese version) and the filesystem is NTFS. – Mike Nov 17 '09 at 01:42
  • @Jack M. Okay, I'm updating my question. – Mike Nov 17 '09 at 01:43
  • I think you should `use utf8;` at the top, and then skip the `decode` step. The utf8 pragma tells Perl that your source code (including string literals) is already UTF-8. – cjm Nov 25 '09 at 22:51
  • @cjm, when we use utf-8, Perl sees chinese characters in the source code as hex numerical representations in utf-8 encoding but my Windows system treats those numerical representations as GB2312 encoding and decode them accordingly, which is not right. Skipping the decode step won't solve the problem. – Mike Nov 26 '09 at 01:45

1 Answers1

6

If you

 use utf8;

in your Perl script, that merely tells perl that the source is in UTF-8. It doesn't affect how perl deals with the outside world. Are you turning on any other Perl Unicode features?

Are you having problems with every filename, or just some of them? Can you give us some examples, or a small demonstration script? I don't have a filesystem that encodes names as GB2312, but have you tried encoding your filenames as GB2312 before you call open?

If you want specific strings encoded with a specific encoding, you can use the Encode module. Try that with your filenames that you give to open.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • @brian, thanks for the answer. Can I let Perl first convert the GB2312 encoded file name as UTF-8 so that it can recognize the file name? I know how to encode the non-utf-8 encoded file content as utf-8, but didn't thought of doing it with the file name. – Mike Nov 17 '09 at 02:19
  • @brian, thanks! I finally solved the problem. You're completely right! The solution is exactly as you foresaw: encode the filenames as GB2312 before calling open. – Mike Nov 17 '09 at 03:07