Perl Japanese to English filename replacement

Question

I put together a perl script that works to replace Japanese file names to English file names. But there are still a couple of things that I don’t quite understand well.

I have the following configuration Client OS:

Windows XP Japan

Notepad++, installed

Server:

Red Hat Enterprise Linux Server release 6.2

Perl v5.10.1

VIM : VIM version 7.2.411

Xterm : ASTEC-X version 6.0

CSH: tcsh 6.17.00 (Astron)

The source of the files are Japanese .csv files generated on Windows. I saw posts about using utf8 and encoding conversion in Perl, and I hope to understand better why I didn’t need anything mentioned in the other threads.

Here is my script that worked? My questions are below.

#!/usr/bin/perl
my $work_dir = "/nas1_home4/fsomeguy/someplace";
opendir(DIR, $work_dir) or die "Cannot open directory";
my @files = readdir(DIR);
foreach (@files) 
{
    my $original_file = $_; 
    s/機/–machine_/; # replace 機 with -machine_
    my $new_file = $_;
    if ($new_file ne $original_file)
    {
        print "Rename " . $original_file . " to " . $new_file;
        rename("${work_dir}/${original_file}", "${work_dir}/${new_file}") or  print "Warning: rename failed because: $!\n";
    }
}

Questions:

1) Why isn’t utf8 required in this sample? In what type of examples would I need it. Use uft8; was discussed: use utf8 gives me 'Wide character in print')? But if I have added use utf8, then this script won’t work.

2) Why isn’t encoding manipulation required in this sample?
I actually wrote the script in Windows using Notepad++ (pasting in the Japanese characters from Windows XP Japan’s Explorer to my script). In Xterm, and VIM, the characters show up as garbage characters. But I didn’t have to deal with Encoding manipulation either, which was discussed here How can I convert japanese characters to unicode in Perl? .

Thanks.

UPDATES 1

Testing a simple localization sample in Perl for filename and file text replacement in Japanese

In Windows XP, copy the 南 character from within a .csv data file and copy to the clipboard, then use it as both the file name (ie. 南.txt) and file content (南). In Notepad++ , reading the file under encoding UTF-8 shows x93xEC, reading it under SHIFT_JIS displays南.

Script:

Use the following Perl script south.pl, which will be run on a Linux server with Perl 5.10

#!/usr/bin/perl
use feature qw(say);

use strict;
use warnings;
use utf8;
use Encode qw(decode encode);

my $user_dir="/usr/frank";
my $work_dir = "${user_dir}/test_south";

# forward declare the function prototypes
sub fileProcess;

opendir(DIR, ${work_dir}) or die "Cannot open directory " . ${work_dir};

# readdir OPTION 1 - shift_jis
#my @files = map { Encode::decode("shift_jis", $_); } readdir DIR; # Note filename    could not be decoded as shift_jis
#binmode(STDOUT,":encoding(shift_jis)");                    

# readdir OPTION 2 - utf8
my @files = map { Encode::decode("utf8", $_); } readdir DIR; # Note filename could be decoded as utf8
binmode(STDOUT,":encoding(utf8)");                           # setting display to output utf8

say @files;                                 

# pass an array reference of files that will be modified
fileNameTranslate();
fileProcess();

closedir(DIR);

exit;

sub fileNameTranslate
{

    foreach (@files) 
    {
        my $original_file = $_; 
        #print "original_file: " . "$original_file" . "\n";     
        s/南/south/;     

        my $new_file = $_;
        # print "new_file: " . "$_" . "\n";

        if ($new_file ne $original_file)
        {
            print "Rename " . $original_file . " to \n\t" . $new_file . "\n";
            rename("${work_dir}/${original_file}", "${work_dir}/${new_file}") or print "Warning: rename failed because: $!\n";
        }
    }
}

sub fileProcess
{

    #   file process OPTION 3, open file as shift_jis, the search and replace would work
    #   open (IN1,  "<:encoding(shift_jis)", "${work_dir}/south.txt") or die "Error: south.txt\n";
    #   open (OUT1, "+>:encoding(shift_jis)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";  

    #   file process OPTION 4, open file as utf8, the search and replace would not work
open (IN1,  "<:encoding(utf8)", "${work_dir}/south.txt") or die "Error: south.txt\n";
    open (OUT1, "+>:encoding(utf8)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";   

    while (<IN1>)
    {
        print $_ . "\n";
        chomp;

        s/南/south/g;


        print OUT1 "$_\n";
    }

    close IN1;
    close OUT1; 
}

Result:

(BAD) Uncomment Option 1 and 3, (Comment Option 2 and 4) Setup: Readdir encoding, SHIFT_JIS; file open encoding SHIFT_JIS Result: file name replacement failed.. Error: utf8 "\x93" does not map to Unicode at .//south.pl line 68. \x93

(BAD) Uncomment Option 2 and 4 (Comment Option 1 and 3) Setup: Readdir encoding, utf8; file open encoding utf8 Result: file name replacement worked, south.txt generated But south1.txt file content replacement failed , it has the content \x93 (). Error: "\x{fffd}" does not map to shiftjis at .//south.pl line 25. ... -Ao?= (Bx{fffd}.txt

(GOOD) Uncomment Option 2 and 3, (Comment Option 1 and 4) Setup: Readdir encoding, utf8; file open encoding SHIFT_JIS Result: file name replacement worked, south.txt generated South1.txt file content replacement worked, it has the content south.

Conclusion:

I had to use different encoding scheme for this example to work properly. Readdir utf8, and file processing SHIFT_JIS since the content of the csv file was SHIFT_JIS encoded.

I think the better question is, why __wouldn't__ it work? If Notepad++ supports unicode, but your version of gvim and xterm don't, then why wouldn't it display the characters? [Discussed here](http://superuser.com/questions/21135/how-can-i-edit-unicode-text-in-notepad). — CornSmith, Jun 11 '13 at 09:05
Built-in `readdir`/`open` etc. are broken w/r/t filesystem encodings. See section *Unicode in Filenames* in [perltodo](http://p3rl.org/todo). To work around the problem, use [Win32-Unicode](http://search.cpan.org/dist/Win32-Unicode) or [Path::Class::Unicode](http://p3rl.org/Path::Class::Unicode) or [PerlIO::fse](http://p3rl.org/PerlIO::fse) or (low-level) [Win32::FileOp](http://p3rl.org/Win32::FileOp)/[Encode::Locale](http://p3rl.org/Encode::Locale). — daxim, Jun 11 '13 at 09:26
CornSmith, I had no problem trying to get xterm to display the Japanese characters, but I also found that isn't a requirement to make this script work. My related question regarding LANG setting is posted here, http://stackoverflow.com/questions/17039705/linux-xterm-and-vim-localization-behavior-of-setenv-lang-vs-setenv-lc-ctype. — frank, Jun 12 '13 at 01:00
daxim, I saw the Path::Class:Unicode example, along with other example I've been with unicode file processing, they usually uses a hex representation "\x{55ed}.txt"; What I really wanted to do, is be able to paste the strings from Windows explorer as is into my Perl script as is (so I don't have to look up the Unicode character code online for each replacement which can be tedious), and then run the Perl script on the Linux server side. It seems like my approach here would be the easiest for what I want to accomplish? — frank, Jun 12 '13 at 01:07
@daxim, I have added UPDATES1 section to show a simplified example of what I am trying to do, and the different encoding combinations that I had to use. — frank, Jun 13 '13 at 04:45

score 2 · Accepted Answer · answered Jun 11 '13 at 09:06

2

Your script is totally unicode unaware. It treats all the strings as sequences of bytes. Fortunately, the bytes encoding the file names are identical to bytes encoding the Japanese characters used in the source. If you tell Perl to use utf8, it would interpret the Japanese characters in your script, but not the ones coming from the file system, so there will be no match.

answered Jun 11 '13 at 09:06

choroba

231,213
25
204
289

Do you mind clarifying some more what you mean by interpret in the statement. "If you tell Perl to use utf8, it would interpret the Japanese characters in your script" Would you be able to give a quick example of what would go on? Thanks. – frank Jun 12 '13 at 00:29
@user2412730: Without `use utf8`, what Perl sees both in your script and the file system is the three bytes 230, 169, 159 (e6 a9 9f in hex). After adding `use utf8`, Perl would interpret the character in the script as character 27231 (hex 6a5f). – choroba Jun 12 '13 at 00:56
Also, it seems like treating the strings as sequence of bytes seems to be fairly striaghtforward if I paste in the replacement strings into my script from the file system strings. It would save a bunch of unicode look up and hex strings in the script, see my comments to daxim above. If I don't treat all strings as sequence of bytes, would I still able to paste human-readable Japanese characters into the script like how I've done it above, or would I need to use unicode hex codes for my search strings like I've seen in quite a few unicode encoding examples online. – frank Jun 12 '13 at 01:09
@user2412730: If you enable unicode semantics on both ends, everything should work as well. The only problem is when you mix bytes and unicode. – choroba Jun 12 '13 at 01:12
Thanks for the clarifications. Looks like I do want to enable utf8, because the second part of the script does the file processing, see my related post http://stackoverflow.com/questions/17061507/perl-file-processing-on-shift-jis-encoded-japanese-files. You mention I should enable unicode semantics on both ends, so the other end is readdir/open, which I saw from the comment from daxim above is broken. I am writing the script for a Linux server, and some of the references above in the comments from daxim is Win-32 related. What class would you recommend? – frank Jun 13 '13 at 00:59
I have added UPDATES1 section to show a simplified example of what I am trying to do, and the different encoding combinations that I had to use. – frank Jun 13 '13 at 04:43

Perl Japanese to English filename replacement

1 Answers1

Linked