2

I'm working against a database that is using UTF8 encoding and has many user names that contain special characters, such as "Ғђ ▫ Sony"

When querying the user table, Lumen responds with incorrect data. I've tried querying the same table using mysqli and PDO and I receive the expected results. I set up a sample route to test it:

$app->get("charset", function() {
    $mysqli = new mysqli("localhost", "user", "password", "database");
    $res = $mysqli->query("select name from users where id = 1");

    $dbh = new PDO('mysql:host=localhost;dbname=database', "user", "password");
    $stmt = $dbh->query("select name from users where id = 1");

    $lumen = DB::select("select name from users where id = 1");

    return response()->json([
        "mysqli" => $res->fetch_assoc(),
        "pdo" => $stmt->fetchAll(PDO::FETCH_ASSOC),
        "framework" => $lumen
    ]);
});

When accessing the route, I get the following response:

{
  "mysqli": {
    "name": "Ғђ ▫ Sony"
  },
  "pdo": [
    {
      "name": "Ғђ ▫ Sony"
    }
  ],
  "framework": [
    {
      "name": "Ò’Ñ’ â–« Sony"
    }
  ]
}

Here's a screenshot of the response in case the text above does not display correctly: broken UTF8 response

As far as I can tell, Lumen's MySQL config defaults to UTF8 and is unchangeable - I found the following in vendor/laravel/lumen-framework/config/database:

'mysql' => [
    'driver'    => 'mysql',
    'host'      => env('DB_HOST', 'localhost'),
    'database'  => env('DB_DATABASE', 'forge'),
    'username'  => env('DB_USERNAME', 'forge'),
    'password'  => env('DB_PASSWORD', ''),
    'charset'   => 'utf8',
    'collation' => 'utf8_unicode_ci',
    'prefix'    => env('DB_PREFIX', ''),
    'timezone'  => env('DB_TIMEZONE','+00:00'),
    'strict'    => false,
],

I'm at a loss as to what could be causing this. What else can I do to try to track down this discrepancy?

Sean Walsh
  • 8,266
  • 3
  • 30
  • 38
  • Is the data saving correctly in the database? Can you see the special characters rendered correctly, say using MYSQL Workbench or a similar tool you use? – Noman Ur Rehman May 14 '15 at 07:26
  • When I query it via MySQL Workbench, it returns "Ò’Ñ’ â–« Sony" as the name. I tried adding `SET NAMES 'utf8'`, but it appears to have no effect. – Sean Walsh May 14 '15 at 13:56
  • encoding issues are always alot of fun, i had some myself aswell.... if you think the DB settings are fine you might wanna double check if the rest is also working.... before you print out your results will this make changes..... header('Content-Type: text/html; charset=utf-8'); //(place it before you start printing our stuff) – lauw May 17 '15 at 08:15
  • @lauw As you can see in the screenshot, the UTF-8 characters render fine when queried via mysqli or a PDO connection that I create. It's only when querying with Lumen's DB library that the characters do not render properly. – Sean Walsh May 17 '15 at 08:18
  • but it might be that your frameworks handles the printing out the data a little different???? i never used laravel but this topic looks very simular + you showed that the DB connections settings looks fine so i would say it must be something with handling whatever comes after it http://stackoverflow.com/questions/26343876/laravel-utf-8-to-database – lauw May 17 '15 at 08:41
  • @lauw Thanks for linking that question. It gave me a little more to dig in to, but I ultimately didn't find anything new. I'm not using Eloquent ORM, so the result is just being returned as a plain array (via PDO::FETCH_ASSOC). – Sean Walsh May 17 '15 at 17:20
  • 1
    Are you sure the data is correctly encoded in utf8 in the database? Seems like you are only using UTF8 for the lumen DB connection (if that is the default), but you do not use UTF8 with your mysqli or PDO connection. Do you get the same result if you set the mysqli charset using `$mysqli->set_charset("utf8");` and PDO charset using `new PDO('mysql:host=localhost;dbname=database;charset=utf8', "user", "password");` ? – Haprog May 18 '15 at 10:43
  • Have you tried using `mb_internal_encoding("UTF-8");`? If you are viewing the output in a web browser make sure the output page charset is defined correctly (e.g. using a header). – Haprog May 18 '15 at 10:45
  • 1
    The MySQL connection charset defines what encoding is used for communication between the MySQL client (PHP) and server. It does not matter what encoding is used in the actual MySQL tables. The MySQL server will automatically convert the data between the table encoding and connection encoding. So the connection encoding basically defines the format in which you expect to get the data from MySQL and also in which format you are inserting data to MySQL. It would seem that here you are correctly getting data in UTF8 from lumen DB connection but the output is not displayed as UTF8. – Haprog May 18 '15 at 10:58
  • 1
    This also explains why the mysqli and PDO output is shown correctly because they are not returning the data in UTF8 (because you have not set their connection charset to UTF8) but by default they seem to match whatever encoding it is you're displaying the output in. – Haprog May 18 '15 at 10:59
  • @Haprog The data is being inserted by another system that I assumed (based on the nature of this system) was connecting to MySQL with the UTF8 charset. Thanks to your comment, I investigated further and discovered that it's actually using latin1. By changing my Lumen connector to use latin1, it pulls the data in the format I'd expect. If you'd like to submit your comment as an answer, I'd be happy to accept it and award the bounty to you. Thank you! – Sean Walsh May 18 '15 at 15:11
  • I posted a combination of my comments as an answer. – Haprog May 19 '15 at 12:40

1 Answers1

1

This answer is based on my previous comments above.

The MySQL connection charset defines what encoding is used for communication between the MySQL client (PHP) and server. It does not matter what encoding is used as the internal encoding in the actual MySQL tables. The MySQL server will automatically convert the data between the table encoding and connection encoding. So the connection encoding basically defines the format in which you expect to get the data from MySQL and also in which format you are inserting data to MySQL.

Are you sure the data is correctly encoded in utf8 in the database?

Seems like you are only using UTF8 for the lumen DB connection (if that is the default), but you do not use UTF8 with your mysqli or PDO connection examples. Do you get the same result if you set the mysqli charset using $mysqli->set_charset("utf8"); and PDO charset using new PDO('mysql:host=localhost;dbname=database;charset=utf8', "user", "password");?

Based on your code and output example it would seem that you are correctly getting data in UTF8 from lumen DB connection but the output is not displayed as UTF8.

This also explains why the mysqli and PDO output is shown correctly because they are not returning the data in UTF8 (because you have not set their connection charset to UTF8) but by default they seem to match whatever encoding it is you're displaying the output in (apparently "latin1" or compatible).

If you are viewing the output in a web browser make sure the output page charset is defined correctly (e.g. using a header).

Edit:

Btw it should not matter what connection encoding is used in the other system that inserts the data as long as the connection encoding matches encoding of the data that is sent through the connection.

Setting connection encoding to latin1 means you will get string data as latin1 when you do a SELECT. So it would seem your output is handled as latin1 instead of UTF-8. It would probably be better if you fixed your script output to correctly display "as UTF-8" instead if your output environment (e.g. web browser) supports it. Because otherwise you will have problems if you need to handle characters that can not be shown in latin1. Though if you output to CLI terminal/console instead then of course you should use the same encoding as your default terminal encoding (which can be UTF-8 or something else). I prefer to have my Linux terminals also configured as UTF-8.

Haprog
  • 793
  • 1
  • 9
  • 21
  • The confusing thing in all of this is that my page's output charset is UTF-8 - I'm 1000% sure of this because I've double checked it a dozen times. Either way, your answer helped me get it working, so that's good enough for me. Thanks again! – Sean Walsh May 19 '15 at 14:23