File encoding for English and Chinese text

Question

I am building a dynamic sql file that can have english characters and chinese/russian/vietnamese etc. Each text excerpt is in its own file and encoded properly. I need to be able to read in each of these files and output a single file that contains all these characters. I am using perl to read in and output the file.

My question has two parts.

What file encoding supports English text and Non-English text?
Using perl, can I convert the input file automatically into the above encoding format?

For part 2, I believe I need to read the file in the proper format to convert it. I have searched and found Encoding::Guess but not sure if this works and also not sure exactly how to use this.

I found this SO Question, the first answer explains a lot but not how?

I don't think you need to guess here. Familiarize yourself with your tool-set and learn to find out which encoding any of your files have. If you don't know for sure with what encoding you start, you don't even need to start. — innaM, Jul 23 '13 at 14:48
Well, running Encode::Guess returns UTF-16BE for most of the asian text. I receive the files from a vendor and need to automate the import into the database. This is why I don't want to determine the file encoding manually but rather let the script handle it for me — Ishikawa91, Jul 23 '13 at 14:53
Please don't guess unless your definition of fun is very, very strange. If you vendor distributes those files, I'm sure he'll be able to tell you which encoding they have. — innaM, Jul 23 '13 at 15:14
Ok (I am new to perl) so lets say my vendor tells me its in encoding X, how do I get it from that to lets say UTF8 (which I believe supports multiple languages)? — Ishikawa91, Jul 23 '13 at 15:23

score 2 · Answer 1 · edited May 23 '17 at 10:26

2

piconv -f UTF-16BE         -t UTF-8 < input-file > output-file
piconv -f $source_encoding -t UTF-8 < input-file > output-file

piconv, an iconv work-alike, is part of Encode and ships with Perl.

To detect the source encoding, use better modules than Encode::Guess. See How can I guess the encoding of a string in Perl?

edited May 23 '17 at 10:26

Community

1
1

answered Jul 23 '13 at 15:46

daxim

39,270
4
65
132

score 1 · Accepted Answer · answered Jul 23 '13 at 15:46

Answering the question in your last comment, here's how to convert from one encoding to another encoding:

#!/usr/bin/perl
use strict;
use warnings;

sub read_encoded {
    my $file_name = shift;
    my $encoding  = shift;

    my $content;
    if ( open my $fh, "<:encoding($encoding)", $file_name ) {
        $content = do {
            local $/;
            <$fh>;
        };
    }
    else {
        die "Could not open $file_name: $!";
    }

    return $content;
}

sub write_file {
    my $file_name = shift;
    my $content   = shift;

    if ( open my $fh, '>:encoding(UTF-8)', $file_name ) {
        print $fh $content;
    }
    else {
        die "Could not open $file_name: $!";
    }
}

my $content1 = read_encoded( 'file1.txt', 'latin-1' );
my $content2 = read_encoded( 'file2.txt', 'UTF-16BE' );

write_file( 'output', $content1 . $content2 );

Assuming you have two files file1.txt and file2.txt, encoded in latin-1 und UTF-16BE, respecitively, this little script will read both files and write the output to a UTF-8-encoded file named output.

All that code is, like, 2 lines when using [File::Slurp](http://p3rl.org/File::Slurp). — daxim, Jul 23 '13 at 15:50

File encoding for English and Chinese text

2 Answers2