Perl file processing on SHIFT_JIS encoded Japanese files

Question

I have a set of SHIFT_JIS (Japanese) encoded csv file from Windows, which I am trying to process on a Linux server running Perl v5.10.1 using regular expressions to make string replacements.

Here is my requirement: I want the Perl script’s regular expressions being human readable (at least to a Japanese person) Ie. like this: s/北/0/g; Instead of it littered with some hex codes s/\x{4eba}/0/g;

Right now, I am editing the Perl script in Notepad++ on Windows, and pasting in the string I need to search for from the csv data file onto the Perl script.

I have the following working test script below:

use strict;
use warnings;
use utf8;

open (IN1,  "<:encoding(shift_jis)", "${work_dir}/tmp00.csv") or die "Error: tmp00.csv\n";
open (OUT1, "+>:encoding(shift_jis)" , "${work_dir}/tmp01.csv") or die "Error: tmp01.csv\n";

while (<IN1>)
{
    print $_ . "\n";
    chomp;
    s/北/0/g;
    s/10:00/9:00/g;     
    print OUT1 "$_\n";
}    

close IN1;
close OUT1;

This would successfully replace the 10:00 with 9:00 in the csv file, but the issue is I was unable to replace北 (ie. North) with 0 unless use utf8 is also included at the top.

Questions:

1) In the open documentation, http://perldoc.perl.org/functions/open.html, I didn’t see use utf8 as a requirement, unless it is implicit?

a) If I had use utf8 only, then the first print statement in the loop would print garbage character to my xterm screen.

b) If I had called open with :encoding(shift_jis) only, then the first print statement in the loop would print Japanese character to my xterm screen, but the replacement would not happen. There is no warning that use utf8 was not specified.

c) If I used both a) and b), then this example works.

How does “use utf8” modify the behavior of calling open with :enoding(shift_jis) in this Perl script?

2) I also tried to open the file without any encoding specified, wouldn’t Perl treat the file strings as raw bytes, and be able to perform regular expression match that way if the strings I pasted in the script, is in the same encoding as the text in the original data file? I was able to do file name replacement earlier this way without specifying any encoding whatsoever (please refer to my related post here: Perl Japanese to English filename replacement).

Thanks.

UPDATES 1

Testing a simple localization sample in Perl for filename and file text replacement in Japanese

In Windows XP, copy the 南 character from within a .csv data file and copy to the clipboard, then use it as both the file name (ie. 南.txt) and file content (南). In Notepad++ , reading the file under encoding UTF-8 shows x93xEC, reading it under SHIFT_JIS displays南.

Script:

Use the following Perl script south.pl, which will be run on a Linux server with Perl 5.10

#!/usr/bin/perl
use feature qw(say);

use strict;
use warnings;
use utf8;
use Encode qw(decode encode);

my $user_dir="/usr/frank";
my $work_dir = "${user_dir}/test_south";

# forward declare the function prototypes
sub fileProcess;

opendir(DIR, ${work_dir}) or die "Cannot open directory " . ${work_dir};

# readdir OPTION 1 - shift_jis
#my @files = map { Encode::decode("shift_jis", $_); } readdir DIR; # Note filename    could not be decoded as shift_jis
#binmode(STDOUT,":encoding(shift_jis)");                    

# readdir OPTION 2 - utf8
my @files = map { Encode::decode("utf8", $_); } readdir DIR; # Note filename could be decoded as utf8
binmode(STDOUT,":encoding(utf8)");                           # setting display to output utf8

say @files;                                 

# pass an array reference of files that will be modified
fileNameTranslate();
fileProcess();

closedir(DIR);

exit;

sub fileNameTranslate
{

    foreach (@files) 
    {
        my $original_file = $_; 
        #print "original_file: " . "$original_file" . "\n";     
        s/南/south/;     

        my $new_file = $_;
        # print "new_file: " . "$_" . "\n";

        if ($new_file ne $original_file)
        {
            print "Rename " . $original_file . " to \n\t" . $new_file . "\n";
            rename("${work_dir}/${original_file}", "${work_dir}/${new_file}") or print "Warning: rename failed because: $!\n";
        }
    }
}

sub fileProcess
{

    #   file process OPTION 3, open file as shift_jis, the search and replace would work
    #   open (IN1,  "<:encoding(shift_jis)", "${work_dir}/south.txt") or die "Error: south.txt\n";
    #   open (OUT1, "+>:encoding(shift_jis)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";  

    #   file process OPTION 4, open file as utf8, the search and replace would not work
open (IN1,  "<:encoding(utf8)", "${work_dir}/south.txt") or die "Error: south.txt\n";
    open (OUT1, "+>:encoding(utf8)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";   

    while (<IN1>)
    {
        print $_ . "\n";
        chomp;

        s/南/south/g;


        print OUT1 "$_\n";
    }

    close IN1;
    close OUT1; 
}

Result:

(BAD) Uncomment Option 1 and 3, (Comment Option 2 and 4) Setup: Readdir encoding, SHIFT_JIS; file open encoding SHIFT_JIS Result: file name replacement failed.. Error: utf8 "\x93" does not map to Unicode at .//south.pl line 68. \x93

(BAD) Uncomment Option 2 and 4 (Comment Option 1 and 3) Setup: Readdir encoding, utf8; file open encoding utf8 Result: file name replacement worked, south.txt generated But south1.txt file content replacement failed , it has the content \x93 (). Error: "\x{fffd}" does not map to shiftjis at .//south.pl line 25. ... -Ao?= (Bx{fffd}.txt

(GOOD) Uncomment Option 2 and 3, (Comment Option 1 and 4) Setup: Readdir encoding, utf8; file open encoding SHIFT_JIS Result: file name replacement worked, south.txt generated South1.txt file content replacement worked, it has the content south.

Conclusion:

I had to use different encoding scheme for this example to work properly. Readdir utf8, and file processing SHIFT_JIS, as the content of the csv file was SHIFT_JIS encoded.

score 1 · Accepted Answer · answered Jun 12 '13 at 16:01

1

A good place to start would be to read the documentation for the utf8 module. Which says:

The use utf8 pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based platforms). The no utf8 pragma tells Perl to switch back to treating the source text as literal bytes in the current lexical scope.

If you don't have use utf8 in your code, then the Perl compiler assumes that your source code is in your system's native single-byte encoding. And the character '北' will make little sense. Adding the pragma tells Perl that your code includes Unicode characters and everything starts to work.

answered Jun 12 '13 at 16:01

Dave Cross

68,119
3
51
97

Thanks Dave, would you be able to explain the difference between UTF-8 and SHIFT_JIS? UTF8 is an encoding scheme, so my confusion is that if SHIFT_JIS is also a (different) encoding scheme, then when I use the utf8 pragma, then my source script (which contains SHIFT_JIS) character, would no longer be interpreted as single byte but as utf8 characters. I then call open with :encoding(SHIFT_JIS), the characters from the file would be SHIFT_JIS characters. Unless the two encoding scheme are equal, I can't see how that would work. I couldn't tell if that is the case from SHIFT_JIS wikipedia page. – frank Jun 13 '13 at 00:26
I have updated an example that illustrates the different encoding scheme that I had use in order to make a filename replacement and file content replacement work. – frank Jun 13 '13 at 04:48
The encoding of filenames is determined by the operating system. You can't change that. The encoding of file contents is determined by the person who writes the file. – Dave Cross Jun 13 '13 at 08:27
UTF-8 and SHIFT-JIS are completely different encoding schemes. The `use utf8` controls how Perl interprets the characters in your source code. It has no effect on your input or output files. Your source code contain the Unicode character '北'. When you decode your data on the way into your program, your data in converted to the Perl's internal character encoding and can therefore be used successfully with literal strings in your source code (which have already been passed through the same conversion). – Dave Cross Jun 13 '13 at 08:32
Thanks, it is starting to make sense, but one thing still gets me. Now my file's content is encoded in SHIFT-JIS, and I added "use utf8" so according to what you are saying, Perl would start to interpret the literal strings in my source code as utf8. Since UTF-8 and SHIFT-JIS are completely different encoding schemes, wouldn't the search and replace on th SHIFT-JIS file content failed. Wouldn't it only work if instead of "use utf8", I do a "use shift_jis" (not sure if there is such a thing). My UPDATES1 script does work, but I'd like to understand this better. – frank Jun 14 '13 at 00:15
Because (as I wrote in my last comment) "when you decode your data on the way into your program, your data in converted to the Perl's internal character encoding and can therefore be used successfully with literal strings in your source code (which have already been passed through the same conversion)". If you write your program correctly (which you have now done) then the data that Perl works on isn't encoded as SHIFT-JIS or encoded as UTF-8 - it's just Perl character strings. – Dave Cross Jun 14 '13 at 06:36
Thanks. As for the literal strings in my source code you mentioned: "(which have already been passed through the same conversion)". The literal strings are pasted from the csv file content (which is SHIFT_JIS encoded). In the source code, I added use utf8, so wouldn't that interpret the strings in my source as utf8 instead of SHIFT_JIS? Or is Perl somehow smart enough to know the characters in my source code is actually SHIFT_JIS encoded and convert it to Perl's internal character encoding (which seems to be the case). – frank Jun 15 '13 at 12:32

Perl file processing on SHIFT_JIS encoded Japanese files

1 Answers1

Linked