I sometimes have to read text files from external sources, which can use various character encodings; usually UTF-8, Latin-1 or Windows CP-1252.
Is there a way to conveniently read these files, autodetecting the encoding like editors such as Vim do?
I'm hoping for something as simple as:
open(my $f, '<:encoding(autodetect)', 'foo.txt') or die 'Oops: $!';
Note that Encode::Guess does not do the trick: it only works if an encoding can be unambiguously detected, otherwise it croaks. Most UTF-8 data is nominally valid latin-1 data, so it fails on UTF-8 files.
Example:
#!/usr/bin/env perl
use 5.020;
use warnings;
use Encode;
use Encode::Guess qw(utf-8 cp1252);
binmode STDOUT => 'utf8';
my $utf8 = "H\x{C3}\x{A9}llo, W\x{C3}\x{B8}rld!"; # "Héllo, Wørld!" in UTF-8
my $latin = "H\x{E9}llo, W\x{F8}rld!"; # "Héllo, Wørld!" in CP-1252
# Version 1
my $enc1 = Encode::Guess->guess($latin);
if (ref($enc1)) {
say $enc1->name, ': ', $enc1->decode($latin);
}
else {
say "Oops: $enc1";
}
my $enc2 = Encode::Guess->guess($utf8);
if (ref($enc2)) {
say $enc2->name, ': ', $enc2->decode($utf8);
}
else {
say "Oops: $enc2";
}
# Version 2
say decode("Guess", $latin);
say decode("Guess", $utf8);
Output:
cp1252: Héllo, Wørld!
Oops: utf-8-strict or utf8 or cp1252
Héllo, Wørld!
cp1252 or utf-8-strict or utf8 at ./guesstest line 32.
The version under "Update" in Borodin's answer works only on UTF-8 data, but croaks on Latin-1 data.
Encode::Guess
can simply not be used if you need to handle both UTF-8 and Latin-1 files.
This isn't intended to be the same question as this one: I'm looking for a way to autodetect when opening a file.