3

I've read Wikipedia's article on Windows-1252 character encoding. For characters whose byte value is < 128, it should be the same as ASCII/UTF-8.

This makes sense:

php -r "var_export(mb_detect_encoding(\"\x92\", 'windows-1252', true));" 'Windows-1252'

A left curly apostrophe is detected properly.

php -r "var_export(mb_detect_encoding(\"a\", 'windows-1252', true));" false

Huh? The letter "a" isn't Windows-1252?

My terminal, where I"m running this, is set to UTF-8. So that should be the same byte sequence as ASCII for the letter 'a'. For the sake of minimizing the variables, if I specify the right Windows-1252 byte sequence:

php -r "var_export(mb_detect_encoding(\"\x61\", 'windows-1252', true));" false

Changing the "strict" parameter (which has pretty useless documentation) does nothing in these cases.

Hut8
  • 6,080
  • 4
  • 42
  • 59

1 Answers1

6

Encoding detection is not supported for windows-1252. According to the mb_detect_order documentation:

mbstring currently implements the following encoding detection filters. If there is an invalid byte sequence for the following encodings, encoding detection will fail.

UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS, ISO-2022-JP

For ISO-8859-, mbstring always detects as ISO-8859-.

For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will fail always.

Community
  • 1
  • 1
borrible
  • 17,120
  • 7
  • 53
  • 75
  • 4
    Heh, and like an *idiot* I had expected that information to be in the `mb_detect_encoding` documentation! – Hut8 Mar 02 '14 at 22:59
  • Thanks! This is probably also why `mb_convert_encoding` falls back to `ISO-8859-1` if it is specified as a fallback for `windows-1252`, even though the characters in the string are invalid in `ISO-8859-1`. PHP making sense as usual. – Ivo Smits Aug 19 '20 at 11:09