2

I am building a dynamic sql file that can have english characters and chinese/russian/vietnamese etc. Each text excerpt is in its own file and encoded properly. I need to be able to read in each of these files and output a single file that contains all these characters. I am using perl to read in and output the file.

My question has two parts.

  1. What file encoding supports English text and Non-English text?

  2. Using perl, can I convert the input file automatically into the above encoding format?

For part 2, I believe I need to read the file in the proper format to convert it. I have searched and found Encoding::Guess but not sure if this works and also not sure exactly how to use this.

I found this SO Question, the first answer explains a lot but not how?

Community
  • 1
  • 1
Ishikawa91
  • 404
  • 4
  • 15
  • I don't think you need to guess here. Familiarize yourself with your tool-set and learn to find out which encoding any of your files have. If you don't know for sure with what encoding you start, you don't even need to start. – innaM Jul 23 '13 at 14:48
  • Well, running Encode::Guess returns UTF-16BE for most of the asian text. I receive the files from a vendor and need to automate the import into the database. This is why I don't want to determine the file encoding manually but rather let the script handle it for me – Ishikawa91 Jul 23 '13 at 14:53
  • Please don't guess unless your definition of fun is very, very strange. If you vendor distributes those files, I'm sure he'll be able to tell you which encoding they have. – innaM Jul 23 '13 at 15:14
  • Ok (I am new to perl) so lets say my vendor tells me its in encoding X, how do I get it from that to lets say UTF8 (which I believe supports multiple languages)? – Ishikawa91 Jul 23 '13 at 15:23

2 Answers2

2
piconv -f UTF-16BE         -t UTF-8 < input-file > output-file
piconv -f $source_encoding -t UTF-8 < input-file > output-file

piconv, an iconv work-alike, is part of Encode and ships with Perl.

To detect the source encoding, use better modules than Encode::Guess. See How can I guess the encoding of a string in Perl?

Community
  • 1
  • 1
daxim
  • 39,270
  • 4
  • 65
  • 132
1

Answering the question in your last comment, here's how to convert from one encoding to another encoding:

#!/usr/bin/perl
use strict;
use warnings;

sub read_encoded {
    my $file_name = shift;
    my $encoding  = shift;

    my $content;
    if ( open my $fh, "<:encoding($encoding)", $file_name ) {
        $content = do {
            local $/;
            <$fh>;
        };
    }
    else {
        die "Could not open $file_name: $!";
    }

    return $content;
}

sub write_file {
    my $file_name = shift;
    my $content   = shift;

    if ( open my $fh, '>:encoding(UTF-8)', $file_name ) {
        print $fh $content;
    }
    else {
        die "Could not open $file_name: $!";
    }
}

my $content1 = read_encoded( 'file1.txt', 'latin-1' );
my $content2 = read_encoded( 'file2.txt', 'UTF-16BE' );

write_file( 'output', $content1 . $content2 );

Assuming you have two files file1.txt and file2.txt, encoded in latin-1 und UTF-16BE, respecitively, this little script will read both files and write the output to a UTF-8-encoded file named output.

innaM
  • 47,505
  • 4
  • 67
  • 87