How to determine whether utf-8 or cp1252 encoding?

Question

Is there way in perl to determine which of utf-8 or cp1252 the encoding of a string is?

If you have at least a few bytes in the range 0x80-0xFF, they would quickly produce invalid UTF-8 if they are in CP1252. See the proposed duplicate for how to detect that. — tripleee, Dec 12 '17 at 05:04
I have filenames that people are creating from Windows or Mac, which means it is either `cp1252` or `utf8` and I need a way to automatically determine which it is. — CJ7, Dec 12 '17 at 06:07
Let me put it this way: UTF-8 is *extremely* picky about what constititutes a valid sequence. Every random string is valid cp1252 but very few random strings are valid UTF-8. Of course, if the file name doesn't contain any characters in the range 0x80-0xFF, it is valid UTF-8 *and* CP1252, and semantically identical in both scenarios. — tripleee, Dec 12 '17 at 06:11
Of course, the real way is to have the person giving it to you not leave out that essential information. — Tom Blodget, Dec 12 '17 at 17:57
I don't see that this is really a duplicate (of what was marked); while that is a useful page it's just one way to go at this. I've added a link from ikegami's answer. Finding more would be helpful. — zdim, Dec 12 '17 at 19:38
There are three to five possible results for this analysis (depending if you consider Windows-1252 to have 256 codepoints or fewer (248, 251)): valid and the same text in both; valid in both but different text; valid in UTF-8 but not in Windows-1252; valid in Windows-1252 but not in UTF-8; invalid in both. [That's the general idea for any two candidate encodings; perhaps valid in UTF-8 but not in Windows-1252 isn't actually possible.] And, then, after considering validity, you can get into likelihood (informed guessing), if you know something about the text expected. — Tom Blodget, Dec 13 '17 at 18:12

zdim · Answer 1 · 2017-12-12T05:45:55.163

The core Encode::Guess should be up to task for this^†

use Encode::Guess;

my $enc = guess_encoding($data, qw(cp1252));  # utf8 among defaults

and then

ref($enc) or die "Can't guess: $enc"; # trap error this way
$utf8 = $enc->decode($data);

(from docs).

In order to not also use the default "ascii, utf8 and UTF-16/32 with BOM" change that first

Encode::Guess->set_suspects(qw(utf8 cp1252));

and then get the encoding

my $enc = guess_encoding($data);

Or, copied from docs

my $decoder = Encode::Guess->guess($data);
die $decoder unless ref($decoder);
my $utf8 = $decoder->decode($data);

See documentation for details.

^† There are plenty of differences; see comment by tripleee and for example this post

ikegami · Answer 2 · 2017-12-12T15:24:32.750

1

my $could_be_utf8 = utf8::decode( my $tmp = $string );

my $could_be_cp1252 = $string !~ /[\x81\x8D\x8F\x90\x9D]/;

If you need to handle a string that contains a mix of both, see Fixing a file consisting of both UTF-8 and Windows-1252.

edited Dec 12 '17 at 15:24

answered Dec 12 '17 at 15:16

ikegami

367,544
15
269
518

How to determine whether utf-8 or cp1252 encoding?

2 Answers2