1

I'm trying to convert a string with unknown charset to UTF-8. I tried all kind of solutions, but everything I try fails. I the code in the answer of this question: PHP: Convert any string to UTF-8 without knowing the original character set, or at least try. This works like a charm on my local vagrant installation. But on my production server, this fail.

The convert code:

iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);

The string to convert: De Krön 2

The error:

iconv(): Detected an illegal character
iconv('', 'UTF-8', 'De Kr\xC3\xB6n 2')

As you can see, the ö is a encoded like \xC3\Xb6. I have read that this might be an issue with copy pasting from MS Word. However, I don't have this in my control. I get a CSV file and need to import this in a database.

Like I said, it is working local on my vagrant (Homestead) installation, but not on my production server. What could cause this issue?

Community
  • 1
  • 1
Timo002
  • 3,138
  • 4
  • 40
  • 65
  • I'd say the issue is `mb_detect_order()`. The fact that you didn't consider its value to be relevant to the question suggests you are not fully aware of how `mb_detect_encoding()` works. – Álvaro González Jan 11 '16 at 12:12
  • BTW, I know that code comes from the linked question but I'm not sure that the iconv and multi-byte extensions share the same exact encoding names. You should try mb_convert_encoding(). – Álvaro González Jan 11 '16 at 12:14
  • @ÁlvaroGonzález, `mb_convert_encoding()` seems to do the job, but only when I force `ASCII` as `from encoding`. For this specific string, `mb_detect_encoding()` returns false. So it cannot detect the encoding. I'm not sure if setting this fixed to `ASCII` would be the right thing to do. And indeed, I'm not fully aware of how `mb_detect_encoding()` works, I only know what it does. At the moment, I only use this code `mb_convert_encoding($text, 'UTF-8', 'ASCII')` what seems to be working fine on the production server. – Timo002 Jan 11 '16 at 12:27
  • Comment by wutz under accepted answer explains it pretty well: *«The way I understand it, mb_detect_encoding goes through the list of supplied encodings, and accepts the first one which has no invalid byte sequences in the string ... For encodings which have no invalid byte sequences such as ISO-8859-1, it's always true. No "smart" heuristics, and results vary greatly with the list (and order) of encodings you pass.»* – Álvaro González Jan 11 '16 at 12:33
  • @ÁlvaroGonzález, OK, so `mb_detect_encoding()` returns false because all supplied encodings fail. You should answer that `mb_convert_encoding()` must be used in stead of `iconv()` and explain why. – Timo002 Jan 11 '16 at 12:48
  • The mb_convert_encoding/iconv choice is a minor issue. To answer your precise question we need data we don't have yet: a binary dump of sample data and output of `mb_detect_order()` come to my mind. – Álvaro González Jan 11 '16 at 13:11

0 Answers0