Is there way in perl to determine which of utf-8
or cp1252
the encoding of a string is?
Asked
Active
Viewed 2,884 times
1
-
If you have at least a few bytes in the range 0x80-0xFF, they would quickly produce invalid UTF-8 if they are in CP1252. See the proposed duplicate for how to detect that. – tripleee Dec 12 '17 at 05:04
-
I have filenames that people are creating from Windows or Mac, which means it is either `cp1252` or `utf8` and I need a way to automatically determine which it is. – CJ7 Dec 12 '17 at 06:07
-
Let me put it this way: UTF-8 is *extremely* picky about what constititutes a valid sequence. Every random string is valid cp1252 but very few random strings are valid UTF-8. Of course, if the file name doesn't contain any characters in the range 0x80-0xFF, it is valid UTF-8 *and* CP1252, and semantically identical in both scenarios. – tripleee Dec 12 '17 at 06:11
-
Of course, the real way is to have the person giving it to you not leave out that essential information. – Tom Blodget Dec 12 '17 at 17:57
-
I don't see that this is really a duplicate (of what was marked); while that is a useful page it's just one way to go at this. I've added a link from ikegami's answer. Finding more would be helpful. – zdim Dec 12 '17 at 19:38
-
There are three to five possible results for this analysis (depending if you consider Windows-1252 to have 256 codepoints or fewer (248, 251)): valid and the same text in both; valid in both but different text; valid in UTF-8 but not in Windows-1252; valid in Windows-1252 but not in UTF-8; invalid in both. [That's the general idea for any two candidate encodings; perhaps valid in UTF-8 but not in Windows-1252 isn't actually possible.] And, then, after considering validity, you can get into likelihood (informed guessing), if you know something about the text expected. – Tom Blodget Dec 13 '17 at 18:12
2 Answers
1
The core Encode::Guess should be up to task for this†
use Encode::Guess;
my $enc = guess_encoding($data, qw(cp1252)); # utf8 among defaults
and then
ref($enc) or die "Can't guess: $enc"; # trap error this way $utf8 = $enc->decode($data);
(from docs).
In order to not also use the default "ascii, utf8 and UTF-16/32 with BOM" change that first
Encode::Guess->set_suspects(qw(utf8 cp1252));
and then get the encoding
my $enc = guess_encoding($data);
Or, copied from docs
my $decoder = Encode::Guess->guess($data); die $decoder unless ref($decoder); my $utf8 = $decoder->decode($data);
See documentation for details.
† There are plenty of differences; see comment by tripleee and for example this post

zdim
- 64,580
- 5
- 52
- 81
1
my $could_be_utf8 = utf8::decode( my $tmp = $string );
my $could_be_cp1252 = $string !~ /[\x81\x8D\x8F\x90\x9D]/;
If you need to handle a string that contains a mix of both, see Fixing a file consisting of both UTF-8 and Windows-1252.

ikegami
- 367,544
- 15
- 269
- 518