5

I am having trouble with PHP regarding encoding.

I have a JavaScript/jQuery HTML5 page interact with my PHP script using $.post. However, PHP is facing a weird problem, probably related to encoding.

When I write

htmlentities("í")

I expect PHP to output í. However, instead it outputs í At the beginning, I thought that I was making some mistake with the encodings, however

htmlentities("í")=="í"?"Good":"Fail";

is outputing "Fail", where

htmlentities("í")=="í"?"Good":"Fail";

But htmlentities($search, null, "utf-8") works as expected.

I want to have PHP communicate with a MySQL server, but it has encoding problems too, even if I use utf8_encode. What should I do?

EDIT: On the SQL command, writing

SELECT id,uid,type,value FROM users,profile
WHERE uid=id AND type='name' AND value='XXX';

where XXX contains no í chars, works as expected, but it does not if there is any 'í' char.

SET NAMES 'utf8';
SET CHARACTER SET 'utf8';
SELECT id,uid,type,value FROM users,profile
WHERE uid=id AND type='name' AND value='XXX';

Not only fails for í chars, but it ALSO fails for strings without any 'special' characters. Removing the ' chars from SET NAMES and SET CHARACTER SET doesn't seem to change anything.

I am connecting to the MySQL database using PDO.

EDIT 2: I am using MySQL version 5.1.30 of XAMPP for Linux.

EDIT 3: Running SHOW VARIABLES LIKE '%character%' from PhpMyAdmin outputs

character_set_client    utf8
character_set_connection    utf8
character_set_database  latin1
character_set_filesystem    binary
character_set_results   utf8
character_set_server    latin1
character_set_system    utf8
character_sets_dir  /opt/lampp/share/mysql/charsets/

Running the same query from my PHP script(with print_r) outputs:

Array
(
    [0] => Array
        (
            [Variable_name] => character_set_client
            [0] => character_set_client
            [Value] => latin1
            [1] => latin1
        )

    [1] => Array
        (
            [Variable_name] => character_set_connection
            [0] => character_set_connection
            [Value] => latin1
            [1] => latin1
        )

    [2] => Array
        (
            [Variable_name] => character_set_database
            [0] => character_set_database
            [Value] => latin1
            [1] => latin1
        )

    [3] => Array
        (
            [Variable_name] => character_set_filesystem
            [0] => character_set_filesystem
            [Value] => binary
            [1] => binary
        )

    [4] => Array
        (
            [Variable_name] => character_set_results
            [0] => character_set_results
            [Value] => latin1
            [1] => latin1
        )

    [5] => Array
        (
            [Variable_name] => character_set_server
            [0] => character_set_server
            [Value] => latin1
            [1] => latin1
        )

    [6] => Array
        (
            [Variable_name] => character_set_system
            [0] => character_set_system
            [Value] => utf8
            [1] => utf8
        )

    [7] => Array
        (
            [Variable_name] => character_sets_dir
            [0] => character_sets_dir
            [Value] => /opt/lampp/share/mysql/charsets/
            [1] => /opt/lampp/share/mysql/charsets/
        )

)

Running

SET NAMES 'utf8';
SET CHARACTER SET 'utf8';
SHOW VARIABLES LIKE '%character%'

outputs an empty array.

luiscubal
  • 24,773
  • 9
  • 57
  • 83

3 Answers3

18

It's very important to specify the encoding of htmlentities to match that of the input, as you did in your final example but omitted in the first three.

htmlentities($text,ENT_COMPAT,'utf-8');

Regarding communications with MySQL, you need to make sure the connection collation and character set matches the data you are transmitting. You can either set this in the configuration file, or at runtime using the following queries:

SET NAMES utf8;
SET CHARACTER SET utf8;

Make sure the table, database and server character sets match as well. There is one setting you can't change at run-time, and that's the server's character set. You need to modify it in the configuration file:

[mysqld]
character-set-server = utf8
default-character-set = utf8 
skip-character-set-client-handshake

Read more on characters sets and collations in MySQL in the manual.

Eran Galperin
  • 86,251
  • 24
  • 115
  • 132
  • PhpMyAdmin says that the field I am trying to get is encoded using utf8_bin, and I thought that'd be enough. I'll try your solution, though. – luiscubal Jan 01 '09 at 23:51
  • The field is encoded in UTF, but you need to make sure the connection is using the same encoding (for some reason the default is ISO-8859) – Eran Galperin Jan 01 '09 at 23:54
  • Thank you. But it's still not working. I've updated my question and added further details. – luiscubal Jan 02 '09 at 00:02
  • What version of MySQL are you using? – Eran Galperin Jan 02 '09 at 00:18
  • I'm using 5.1.30 (XAMPP for Linux) /opt/lampp/bin/mysql --version /opt/lampp/bin/mysql Ver 14.14 Distrib 5.1.30, for pc-linux-gnu (i686) using EditLine wrapper I'll add this information to the post. – luiscubal Jan 02 '09 at 00:24
  • Argh! Comments do not support line breaks! – luiscubal Jan 02 '09 at 00:27
  • It is very weird you cannot change the collation from the command line. Do you have superadmin privileges? – Eran Galperin Jan 02 '09 at 01:16
  • In this particular case, I'm not logged in as root. However, PhpMyAdmin does not list any 'SET' privilege. And I've set database-specific privileges for this user. Which particular one should I be looking at? – luiscubal Jan 02 '09 at 14:44
  • Forget what I said about privileges, it should work regardless. Are you sure the collation queries are failing? try to run "SHOW VARIABLES LIKE '%character%';" and see what collation is set for the current connection – Eran Galperin Jan 02 '09 at 15:57
  • The file "etc/my.cnf" doesn't list those settings. Should I add them manually? Also, how can PhpMyAdmin have done it right if it requires manually modifying the configuration? – luiscubal Jan 02 '09 at 18:05
  • you can add those manually. and what do you mean done it right? what did it do right? – Eran Galperin Jan 02 '09 at 18:12
  • When I manually run the queries on PhpMyAdmin, it usually outputs correctly. This bug happens *only* on MY PHP script. SHOW VARIABLES LIKE, for example, outputs differently using PhpMyAdmin. – luiscubal Jan 02 '09 at 19:28
  • Sorry, I missed that the results you posted were different. Yes, the connection collation is definitely not set correctly in your PHP script. Are you sure you are running the queries I told you in the same script? – Eran Galperin Jan 02 '09 at 19:51
  • In phpMyAdmin, running SET NAMES utf8; SET CHARACTER SET utf8; SHOW VARIABLES LIKE '%character%'; works perfectly. In my script print_r(callPdoQuery(/*Same query as before, omitted due for comment length reasons*/)->fetchAll()); returns an empty array, even if I run as root. Could it be PDO-related? – luiscubal Jan 02 '09 at 21:15
  • That's probably it. Instead of callPdoQuery, use exec() on a PDO object to run the queries. – Eran Galperin Jan 02 '09 at 21:22
  • However, PDO::exec does not return the result(of SELECT statements, in this case), so I can't use it. prepare(), then execute() has no visible differences. – luiscubal Jan 02 '09 at 21:44
  • I meant using exec() only for setting the character sets/names. You can then use a regular fetchAll() to see if it the environment variables are set correctly. – Eran Galperin Jan 02 '09 at 21:49
  • Changing my.cnf has no visible effects. I think that "SET" commands are intercepting the result. Perhaps, when I run "SET NAMES", it's that value that's returned to fetchAll(), instead of the SELECT result! This shouldn't be a PDO problem because when I use mysql_*, it gives errors anyway. – luiscubal Jan 02 '09 at 21:52
  • Listen, I don't know what to tell you. This the standard practice for enforcing connection encoding, and I've used it on dozens of systems successfully. – Eran Galperin Jan 02 '09 at 21:59
  • OK... :( This is only a development server so I guess I'll have to hope that the final server doesn't have this problem... – luiscubal Jan 02 '09 at 22:22
7

Late revival. But for further reference here are some extra tips:

  1. Use mysql_set_charset instead of SET xxx
  2. Make sure you are saving the file with UTF-8 encoding (this is often overlooked)
  3. Set headers:
    <?php header("Content-type: text/html; charset=utf-8"); ?>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

  4. If your Apache server configuration contains a AddDefaultCharset directive with a different encoding go yell at your host administrator.
Anthony Accioly
  • 21,918
  • 9
  • 70
  • 118
  • Sorry, but this was a development server, which I no longer have. So I am now unable to reproduce this problem. – luiscubal Aug 24 '11 at 12:56
1

I just ran into this issue. I have a whole website's content in Spanish, with all the special characters you can expect (áéíóúñ) and their capital letter versions.

In my case it was an inconsistency with the server charset/collation. Everything else was set to utf8, but the server charset, which had latin1. This caused all utf8 data entered in the database to display in its raw encoded form, likeL í would equal an A with tilde ~ ...

I am using mysqli, and to fix it, I made use of the method explained above by Anthony Accioly (using mysql_set_charset). Said method has a mysqli version and that is what I used.

After that, I was puzzled. I still had a mess when viewing my website. Of course, I didn't know that by changing that latin1 to utf8 I would also mess up the character encode/decode of the whole thing. So I used the help of an online string encoder/decoder to fix my table data.

I made various exports of all my content data (you can set them up to get update queries and that will be faster for your update process) and ran the sql output through the afore mentioned online encoder/decoder, then copy pasted the fixed queries on phpmyadmin sql panel... thus fixing my encoding errors. Everything is now how it should be, AND I am able to process lossy searches again: Maria, maria, maría, mariá will all match maría, maria, Maria, etc. All acute characters evaluate to their base vowel character. Epic Win.

EffectiX
  • 316
  • 2
  • 6