1

Is there way in perl to determine which of utf-8 or cp1252 the encoding of a string is?

ikegami
  • 367,544
  • 15
  • 269
  • 518
CJ7
  • 22,579
  • 65
  • 193
  • 321
  • If you have at least a few bytes in the range 0x80-0xFF, they would quickly produce invalid UTF-8 if they are in CP1252. See the proposed duplicate for how to detect that. – tripleee Dec 12 '17 at 05:04
  • I have filenames that people are creating from Windows or Mac, which means it is either `cp1252` or `utf8` and I need a way to automatically determine which it is. – CJ7 Dec 12 '17 at 06:07
  • Let me put it this way: UTF-8 is *extremely* picky about what constititutes a valid sequence. Every random string is valid cp1252 but very few random strings are valid UTF-8. Of course, if the file name doesn't contain any characters in the range 0x80-0xFF, it is valid UTF-8 *and* CP1252, and semantically identical in both scenarios. – tripleee Dec 12 '17 at 06:11
  • Of course, the real way is to have the person giving it to you not leave out that essential information. – Tom Blodget Dec 12 '17 at 17:57
  • I don't see that this is really a duplicate (of what was marked); while that is a useful page it's just one way to go at this. I've added a link from ikegami's answer. Finding more would be helpful. – zdim Dec 12 '17 at 19:38
  • There are three to five possible results for this analysis (depending if you consider Windows-1252 to have 256 codepoints or fewer (248, 251)): valid and the same text in both; valid in both but different text; valid in UTF-8 but not in Windows-1252; valid in Windows-1252 but not in UTF-8; invalid in both. [That's the general idea for any two candidate encodings; perhaps valid in UTF-8 but not in Windows-1252 isn't actually possible.] And, then, after considering validity, you can get into likelihood (informed guessing), if you know something about the text expected. – Tom Blodget Dec 13 '17 at 18:12

2 Answers2

1

The core Encode::Guess should be up to task for this

use Encode::Guess;

my $enc = guess_encoding($data, qw(cp1252));  # utf8 among defaults

and then

ref($enc) or die "Can't guess: $enc"; # trap error this way
$utf8 = $enc->decode($data);

(from docs).

In order to not also use the default "ascii, utf8 and UTF-16/32 with BOM" change that first

Encode::Guess->set_suspects(qw(utf8 cp1252));

and then get the encoding

my $enc = guess_encoding($data);

Or, copied from docs

my $decoder = Encode::Guess->guess($data);
die $decoder unless ref($decoder);
my $utf8 = $decoder->decode($data);

See documentation for details.


There are plenty of differences; see comment by tripleee and for example this post

zdim
  • 64,580
  • 5
  • 52
  • 81
1
my $could_be_utf8 = utf8::decode( my $tmp = $string );

my $could_be_cp1252 = $string !~ /[\x81\x8D\x8F\x90\x9D]/;

If you need to handle a string that contains a mix of both, see Fixing a file consisting of both UTF-8 and Windows-1252.

ikegami
  • 367,544
  • 15
  • 269
  • 518