4

I know there are plenty of related topics about this issue, but I haven't been able to fix the problem with any of them.

I have a MySQL table with words and some of them can contain scandinavian letters, such as å, ä and ö. When I output them simply with echo or print_r(), the output is always �. I have tried using utf8_encode(), which shows a different invalid result. Using mb_detect_encoding(), I have noticed the encoding of the words containing these letters is UTF-8 already.

Example words:

A = the word (and expected output)
B = echo word
C = echo utf8_encode(word)
D = mb_detect_encoding(word)
E = mb_detect_encoding(utf8_encode(word))

+-------+-------+-------+-------+-------+
|   A   |   B   |   C   |   D   |   E   |
+-------+-------+-------+-------+-------+
| word  | word  | word  | ASCII | ASCII |
|  työ  |  ty�  | tyã¶  | UTF-8 | UTF-8 |
|  ylä  |  yl�  | yl㤠 | UTF-8 | UTF-8 |
+-------+-------+-------+-------+-------+

The collation of all of my MySQL tables is set to utf8 - utf8_swedish_ci and when initializing PDO I have

$dbh = new PDO("mysql:host=xxxx;dbname=yyyy;charset=utf8", "zzzz", "****");
$dbh->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, "SET NAMES 'utf8'");

Also, the encoding of all of my files is set to UTF-8 without BOM and before outputting I have header("Content-Type: text/html; charset=UTF-8");

Using ini_set('default_charset', 'UTF-8'); in the beginning of PHP file does nothing.

So, the question is - how can I actually output the words correctly? I'd also like to know why is utf8_encode() changing the output from wrong (UTF-8) to different wrong (still UTF-8) so I'd actually learn something about this mess called encoding.

Prisoner
  • 27,391
  • 11
  • 73
  • 102
bloodleh
  • 493
  • 8
  • 28
  • Is your file AS UTF-8 and not as ANSI? – Funk Forty Niner Sep 09 '14 at 20:31
  • Yes, all of my files are set as UTF-8 without BOM. – bloodleh Sep 09 '14 at 20:32
  • Did you go through this yet http://stackoverflow.com/questions/279170/utf-8-all-the-way-through – Funk Forty Niner Sep 09 '14 at 20:33
  • I did now. Tried changing collation and charset at db init to utf8mb4 but it didn't change anything. Also checked [this](http://www.sebastianviereck.de/en/php-mysql-special-characters-umlauts-utf8-iso/) - there must be a better way than using `str_replace()`. – bloodleh Sep 09 '14 at 20:45
  • low tech solution: convert them with `htmlentities` before you store them in the db http://www.danshort.com/HTMLentities/index.php?w=latin – Joe T Sep 09 '14 at 21:00
  • What does `echo rawurlencode($word)` say? – georg Sep 09 '14 at 21:10
  • 3
    Also your data looks like `strtolower` has been applied to utf8 strings. Did you do that? – georg Sep 09 '14 at 21:13
  • @georg You are right, I did use `strtolower`, straight after pulling the data from DB. Now after investigating [I found out it was a terrible idea](http://stackoverflow.com/questions/2516448/problems-with-strtolower-function). Damn PHP5 and multi-byte characters! Thank you for pointing me at the correct answer! – bloodleh Sep 09 '14 at 21:29
  • 4
    when handling multibyte encoding (like UTF-8) the way to go is using mb_* variants like 'mb_strtolower'. For a reference on using unicode: http://www.joelonsoftware.com/articles/Unicode.html and http://php.net/manual/en/mbstring.overload.php – Blablaenzo Feb 10 '15 at 23:06
  • Probably you should answer your question. It's still on "unanswered" list :) – Paweł Tomkiel Feb 26 '15 at 20:47
  • 1
    @Paul Tomkiel Thanks for reminding! I thought it would have been silly to answer my own question (especially when someone else provided the answer!) but as it has been so long, it's best to get it on the answered list. – bloodleh Feb 27 '15 at 14:21

1 Answers1

2

The problem was caused by using strtolower on the strings.

Apparently PHP5 is not UTF-8 compatible and regular string manipulation does not work on multi-byte characters.

The solution was using mb_strtolower (documentation) instead with UTF-8 encoding.

More info: Function Overloading Feature (provided by Blablaenzo)

Thanks georg for the answer!

Community
  • 1
  • 1
bloodleh
  • 493
  • 8
  • 28