mb_detect_encoding doesn't properly working with Windows-1250 (CP1250)

Question

I have problem with detecting CP1250 in mb_detect_encoding(), in my case I want detect 3 encodings:

mb_detect_encoding($string, 'UTF-8,ISO-8859-2,Windows-1250')

But Windows isn't in supported encodings, any solution?

score 7 · Answer 1 · answered Jun 14 '13 at 08:51

mb_detect_encoding always "detects" single-byte encodings. You can read about this in the documentation for mb_detect_order:

mbstring currently implements the following encoding detection filters. If there is an invalid byte sequence for the following encodings, encoding detection will fail.

UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP

For ISO-8859-X, mbstring always detects as ISO-8859-X.

For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will fail always.

Conclusions:

It's meaningless to ask for detection of ISO-8859-2; it will always tell you "yes, that's it" (unless of course it detects UTF-8 first).
Windows-1250 is not supported, but even if it were it would work exactly like ISO-8859-2.

In general, it is impossible to detect single-byte encodings with accuracy. If you find yourself needing to do that in PHP you will need to do it manually; don't expect very good results.

score 1 · Answer 2 · answered Jun 14 '13 at 08:47

1

It is not feasible to distinguish ISO-8859-2 from Windows-1250, or any other single-byte encoding from any other encoding for that matter. mb_detect_encoding simply gives you the first encoding which is valid for the given string, and both are equally valid. "Detecting" encodings is by definition not possible with any amount of accuracy.

answered Jun 14 '13 at 08:47

deceze

510,633
85
743
889

1

detecting encoding is important when you need to be sure that you are working with utf8. f.e. `iconv('Windows-1250', 'UTF-8', $str)` will give you different results than `iconv('ISO-8859-2', 'UTF-8', $html)` – gondo Sep 26 '13 at 13:39
Yes of course it will give you different results, just as getting wrong whether you're converting from a JPEG or PNG would. That still doesn't make any argument for why detection is possible. Or am I missing your point? – deceze Sep 26 '13 at 14:37
well if its possible to determinate that encoding conversion (in my example iconv Windows-1250 to UTF-8) returns wrong result, than based on this it should be possible to determinate proper encoding simply by trying different conversions and comparing results. – gondo Sep 28 '13 at 09:33
And how do you determine that the result is "wrong"? If it works technically without error (which it does from any 8-bit encoding, always), then the conversion is "correct". Only you as a human can figure out that it is apparently not. Or you'd have to apply really good heuristics, but that's still *guessing*. – deceze Sep 28 '13 at 09:50
correct. however "guessing" is "detection with `some` amount of accuracy". `enconv` does pretty good job in guessing, in fact its the best thing i could find. – gondo Sep 28 '13 at 10:08
You can make a very good guess based on country which is the source of character string. There are characters in one encoding that are used in some countries (like Poland or Czech Republic) when their counterparts in the other encoding (same hex code) are not used at all or very rarely. You can base on checking for existence of those characters or statistics (counting). For example if there are many xb1 and xb9 characters and it's Polish language text then it is probably Windows-1250. Those are "ą" and "ś" in that encoding which is often used. Those are "±" and "¶" in ISO-8859-2 - rarely used. – Zbyszek Jul 01 '15 at 17:23

mb_detect_encoding doesn't properly working with Windows-1250 (CP1250)

2 Answers2

Linked