Autodetect the character encoding when reading a file

Question

I sometimes have to read text files from external sources, which can use various character encodings; usually UTF-8, Latin-1 or Windows CP-1252.

Is there a way to conveniently read these files, autodetecting the encoding like editors such as Vim do?

I'm hoping for something as simple as:

open(my $f, '<:encoding(autodetect)', 'foo.txt') or die 'Oops: $!';

Note that Encode::Guess does not do the trick: it only works if an encoding can be unambiguously detected, otherwise it croaks. Most UTF-8 data is nominally valid latin-1 data, so it fails on UTF-8 files.

Example:

#!/usr/bin/env perl

use 5.020;
use warnings;

use Encode;
use Encode::Guess qw(utf-8 cp1252);

binmode STDOUT => 'utf8';

my $utf8 = "H\x{C3}\x{A9}llo, W\x{C3}\x{B8}rld!"; # "Héllo, Wørld!" in UTF-8
my $latin = "H\x{E9}llo, W\x{F8}rld!";            # "Héllo, Wørld!" in CP-1252

# Version 1
my $enc1 = Encode::Guess->guess($latin);
if (ref($enc1)) {
    say $enc1->name, ': ', $enc1->decode($latin);
}
else {
    say "Oops: $enc1";
}
my $enc2 = Encode::Guess->guess($utf8);
if (ref($enc2)) {
    say $enc2->name, ': ', $enc2->decode($utf8);
}
else {
    say "Oops: $enc2";
}

# Version 2
say decode("Guess", $latin);
say decode("Guess", $utf8);

Output:

cp1252: Héllo, Wørld!
Oops: utf-8-strict or utf8 or cp1252
Héllo, Wørld!
cp1252 or utf-8-strict or utf8 at ./guesstest line 32.

The version under "Update" in Borodin's answer works only on UTF-8 data, but croaks on Latin-1 data. Encode::Guess can simply not be used if you need to handle both UTF-8 and Latin-1 files.

This isn't intended to be the same question as this one: I'm looking for a way to autodetect when opening a file.

There is no way to reliably detect the character set for plain text. — Mat, Jul 06 '18 at 11:08
Not 100% reliably, no. But it should certainly be possible to do a “good enough” detection of the major charsets. Even with a 1% failure rate it's still better than just assuming UTF-8 or Latin-1 and falling over or showing weird characters all the time. (And vim, for instance, does a great job: I can't remember if ever misdetecting a file.) — mscha, Jul 06 '18 at 11:39
Note that Windows 1252 (CP-1252) is ***not*** the same as ISO-8859-1. The former has printable characters in the first sixteen characters of the upper register `"\x80"` to `"\9F"` where the latter has the same range unassigned to leave space for the C1 control character set. — Borodin, Jul 06 '18 at 12:20
Windows-1252 (cp1252) is a superset of iso-8859-1, though. (HTML5, for instance, uses windows-1252 as a default to deal with incorrectly specified Windows content.) — mscha, Jul 06 '18 at 12:27
@mscha: I commented only because you wrote `latin-1/windows-1252` in your original post. Latin-1 and ISO-8859-1 are two names for the same encoding, but CP-1252 is different. When guessing an encoding it is best to assume that, if the sample data uses character codes `"\x80"` to `"\9F"` then it is using CP-1252 where the characters are printable. In Latin-1, the *Latin-1 Supplement* assigns control characters to this range, but I have never seen them used, and I would have thought the existing 32 ASCII control characters were sufficient for any purpose. — Borodin, Jul 06 '18 at 12:42
@TomB: Okay, I over-reached with my claim for ASCII. But UTF-7 is rare and even Microsoft are dropping UTF-16, even so `Encode::Guess` still identifies them both. — Borodin, Jul 06 '18 at 14:43

mscha · Answer 1 · 2018-07-06T12:51:09.167

5

Here's my current workaround. Does the trick, at least for UTF-8 and Latin-1 (or Windows-1252) files.

use 5.024;
use experimental 'signatures';
use Encode qw(decode);

sub slurp($file)
{
    # Read the raw bytes
    local $/;
    open (my $fh, '<:raw', $file) or return undef();
    my $raw = <$fh>;
    close($fh);

    my $content;

    # Try to interpret the content as UTF-8
    eval { my $text = decode('utf-8', $raw, Encode::FB_CROAK); $content = $text };

    # If this failed, interpret as windows-1252 (a superset of iso-8859-1 and ascii)
    if (!$content) {
        eval { my $text = decode('windows-1252', $raw, Encode::FB_CROAK); $content = $text };
    }

    # If this failed, give up and use the raw bytes
    if (!$content) {
        $content = $raw;
    }

    return $content;
}

edited Jul 06 '18 at 12:51

answered Jul 06 '18 at 11:47

mscha

6,509
3
24
40

Perhaps some words about how this works? – Borodin Jul 06 '18 at 12:05
Added some comments to the code. – mscha Jul 06 '18 at 12:30
Comments only obfuscate code. A description in a parallel text document is far more useful. – Borodin Jul 06 '18 at 12:51
It works for me. Like, it reads the file in bytes (raw) and convert it to given character set. – balakrishnan Jul 18 '19 at 13:38

Borodin · Answer 2 · 2018-07-06T13:37:28.380

It depends what possible encodings you may be dealing with. Take a look at the Encode::Guess module

In general, it's very easy to tell whether you don't have an ASCII file as the code points are seven bits, so anything over 127 means it's not ASCII. It's also possible to tell reliably whether your file isn't UTF-8 as multi-byte characters have a specific sequence for their most-significant bits. Anything else is less reliable, but possible

I don't know what encodings you might be using, but this is the general idea. Encode::Guess is part of the core Encode module, and so shouldn't need installing

use Encode::Guess;

my $enc = guess_encoding($data, qw/ ascii cp1252 iso-8859-1 utf-8 /);
say ref $enc? $enc->name : $enc, "\n";

or you can perform a best-guess decoding without checking what the module has chosen

  use Encode::Guess qw/ ascii cp1252 iso-8859-1 utf-8 /;

  my $chars = decode("Guess", $data);

Remember that the fewer the possible encodings that you offer, the more likely the guess is to be accurate. You should read the module documentation carefully

Update

Here's the OP's one-liner demonstration that Encode::Guess "doesn't do the trick" written as a proper program

Note that, as the documentation says, guess_encoding may sometimes return a string like utf-8 or iso-8859-1 in which case there is ambiguity that the programmer must handle. *That isn't the case in the OP's example: the data is identified as being UTF-8-encoded, and both guess_encoding and decode('guess', ...) return the correct results

You can use this code to test Encode::Guess with any byte string that you choose: just modify the contents of $raw

use strict;
use warnings 'all';
use feature 'say';
use open qw/ :std encoding(UTF-8) /;

use Encode;
use Encode::Guess;
use Data::Dump;

my $raw = qq/H\x{C3}\x{A9}llo, W\x{C3}\x{B8}rld!/;

my $enc = guess_encoding($raw);

if ( my $class = ref $enc ) {
    printf qq{Guessed encoding \$enc is an %s object "%s"\n}, $class, $enc->name
}
else {
    printf qq{Guessed encoding \$enc is a scalar "%s"\n}, $enc;
}

my $chars = decode('guess', $raw);

printf "Decoded characters: %s\n", $chars;
dd $chars;

output

Guessed encoding $enc is an Encode::utf8 object "utf8"
Decoded characters: Héllo, Wørld!
"H\xE9llo, W\xF8rld!"

Autodetect the character encoding when reading a file

2 Answers2

Update

output

Linked