2

I sometimes have to read text files from external sources, which can use various character encodings; usually UTF-8, Latin-1 or Windows CP-1252.

Is there a way to conveniently read these files, autodetecting the encoding like editors such as Vim do?

I'm hoping for something as simple as:

open(my $f, '<:encoding(autodetect)', 'foo.txt') or die 'Oops: $!';

Note that Encode::Guess does not do the trick: it only works if an encoding can be unambiguously detected, otherwise it croaks. Most UTF-8 data is nominally valid latin-1 data, so it fails on UTF-8 files.

Example:

#!/usr/bin/env perl

use 5.020;
use warnings;

use Encode;
use Encode::Guess qw(utf-8 cp1252);

binmode STDOUT => 'utf8';

my $utf8 = "H\x{C3}\x{A9}llo, W\x{C3}\x{B8}rld!"; # "Héllo, Wørld!" in UTF-8
my $latin = "H\x{E9}llo, W\x{F8}rld!";            # "Héllo, Wørld!" in CP-1252

# Version 1
my $enc1 = Encode::Guess->guess($latin);
if (ref($enc1)) {
    say $enc1->name, ': ', $enc1->decode($latin);
}
else {
    say "Oops: $enc1";
}
my $enc2 = Encode::Guess->guess($utf8);
if (ref($enc2)) {
    say $enc2->name, ': ', $enc2->decode($utf8);
}
else {
    say "Oops: $enc2";
}

# Version 2
say decode("Guess", $latin);
say decode("Guess", $utf8);

Output:

cp1252: Héllo, Wørld!
Oops: utf-8-strict or utf8 or cp1252
Héllo, Wørld!
cp1252 or utf-8-strict or utf8 at ./guesstest line 32.

The version under "Update" in Borodin's answer works only on UTF-8 data, but croaks on Latin-1 data. Encode::Guess can simply not be used if you need to handle both UTF-8 and Latin-1 files.

This isn't intended to be the same question as this one: I'm looking for a way to autodetect when opening a file.

mscha
  • 6,509
  • 3
  • 24
  • 40
  • 3
    There is no way to reliably detect the character set for plain text. – Mat Jul 06 '18 at 11:08
  • Not 100% reliably, no. But it should certainly be possible to do a “good enough” detection of the major charsets. Even with a 1% failure rate it's still better than just assuming UTF-8 or Latin-1 and falling over or showing weird characters all the time. (And vim, for instance, does a great job: I can't remember if ever misdetecting a file.) – mscha Jul 06 '18 at 11:39
  • Note that Windows 1252 (CP-1252) is ***not*** the same as ISO-8859-1. The former has printable characters in the first sixteen characters of the upper register `"\x80"` to `"\9F"` where the latter has the same range unassigned to leave space for the C1 control character set. – Borodin Jul 06 '18 at 12:20
  • 1
    Windows-1252 (cp1252) is a superset of iso-8859-1, though. (HTML5, for instance, uses windows-1252 as a default to deal with incorrectly specified Windows content.) – mscha Jul 06 '18 at 12:27
  • @mscha: I commented only because you wrote `latin-1/windows-1252` in your original post. Latin-1 and ISO-8859-1 are two names for the same encoding, but CP-1252 is different. When guessing an encoding it is best to assume that, if the sample data uses character codes `"\x80"` to `"\9F"` then it is using CP-1252 where the characters are printable. In Latin-1, the *Latin-1 Supplement* assigns control characters to this range, but I have never seen them used, and I would have thought the existing 32 ASCII control characters were sufficient for any purpose. – Borodin Jul 06 '18 at 12:42
  • @TomB: Okay, I over-reached with my claim for ASCII. But UTF-7 is rare and even Microsoft are dropping UTF-16, even so `Encode::Guess` still identifies them both. – Borodin Jul 06 '18 at 14:43

2 Answers2

5

Here's my current workaround. Does the trick, at least for UTF-8 and Latin-1 (or Windows-1252) files.

use 5.024;
use experimental 'signatures';
use Encode qw(decode);

sub slurp($file)
{
    # Read the raw bytes
    local $/;
    open (my $fh, '<:raw', $file) or return undef();
    my $raw = <$fh>;
    close($fh);

    my $content;

    # Try to interpret the content as UTF-8
    eval { my $text = decode('utf-8', $raw, Encode::FB_CROAK); $content = $text };

    # If this failed, interpret as windows-1252 (a superset of iso-8859-1 and ascii)
    if (!$content) {
        eval { my $text = decode('windows-1252', $raw, Encode::FB_CROAK); $content = $text };
    }

    # If this failed, give up and use the raw bytes
    if (!$content) {
        $content = $raw;
    }

    return $content;
}
mscha
  • 6,509
  • 3
  • 24
  • 40
2

It depends what possible encodings you may be dealing with. Take a look at the Encode::Guess module

In general, it's very easy to tell whether you don't have an ASCII file as the code points are seven bits, so anything over 127 means it's not ASCII. It's also possible to tell reliably whether your file isn't UTF-8 as multi-byte characters have a specific sequence for their most-significant bits. Anything else is less reliable, but possible

I don't know what encodings you might be using, but this is the general idea. Encode::Guess is part of the core Encode module, and so shouldn't need installing

use Encode::Guess;

my $enc = guess_encoding($data, qw/ ascii cp1252 iso-8859-1 utf-8 /);
say ref $enc? $enc->name : $enc, "\n";

or you can perform a best-guess decoding without checking what the module has chosen

  use Encode::Guess qw/ ascii cp1252 iso-8859-1 utf-8 /;

  my $chars = decode("Guess", $data);

Remember that the fewer the possible encodings that you offer, the more likely the guess is to be accurate. You should read the module documentation carefully


Update

Here's the OP's one-liner demonstration that Encode::Guess "doesn't do the trick" written as a proper program

Note that, as the documentation says, guess_encoding may sometimes return a string like utf-8 or iso-8859-1 in which case there is ambiguity that the programmer must handle. *That isn't the case in the OP's example: the data is identified as being UTF-8-encoded, and both guess_encoding and decode('guess', ...) return the correct results

You can use this code to test Encode::Guess with any byte string that you choose: just modify the contents of $raw

use strict;
use warnings 'all';
use feature 'say';
use open qw/ :std encoding(UTF-8) /;

use Encode;
use Encode::Guess;
use Data::Dump;

my $raw = qq/H\x{C3}\x{A9}llo, W\x{C3}\x{B8}rld!/;

my $enc = guess_encoding($raw);

if ( my $class = ref $enc ) {
    printf qq{Guessed encoding \$enc is an %s object "%s"\n}, $class, $enc->name
}
else {
    printf qq{Guessed encoding \$enc is a scalar "%s"\n}, $enc;
}

my $chars = decode('guess', $raw);

printf "Decoded characters: %s\n", $chars;
dd $chars;

output

Guessed encoding $enc is an Encode::utf8 object "utf8"
Decoded characters: Héllo, Wørld!
"H\xE9llo, W\xF8rld!"
Borodin
  • 126,100
  • 9
  • 70
  • 144