28

I have read that mysql >= 5.5.3 fully supports every possible character if you USE the encoding utf8mb4 for a certain table/column http://mathiasbynens.be/notes/mysql-utf8mb4

looks nice. Only I noticed that the mb_functions in php does not! I cannot find it anywhere in the list: http://php.net/manual/en/mbstring.supported-encodings.php

Not only have I read things but I also made a test.

I have added data to a mysql utf8mb4 table using a php script where the internal encoding was set to UTF-8: mb_internal_encoding("UTF-8");

and, as expected, the characters looks messy once in the db.

Any idea how I can make php and mysql talk the same encoding (possibly a 4 bytes one) and still have FULL support to any world language?

Also why is utf8mb4 different from utf32?

nourdine
  • 7,407
  • 10
  • 46
  • 58
  • 1
    possible duplicate of [Manipulating utf8mb4 data from MySQL with PHP](http://stackoverflow.com/questions/13028537/manipulating-utf8mb4-data-from-mysql-with-php) – Nick Andriopoulos Jun 03 '13 at 08:54
  • How to convert variable (text/string) to utf8mb4 in php https://stackoverflow.com/questions/46873897/how-to-convert-variable-text-string-to-utf8mb4-in-php – Mukhyyar Oct 22 '17 at 12:15

4 Answers4

41

MySQL's utf8 encoding is not actual UTF-8. It's an encoding that is kinda like UTF-8, but only supports a subset of what UTF-8 supports. utf8mb4 is actual UTF-8. This difference is an internal implementation detail of MySQL. Both look like UTF-8 on the PHP side. Whether you use utf8 or utf8mb4, PHP will get valid UTF-8 in both cases.

What you need to make sure is that the connection encoding between PHP and MySQL is set to utf8mb4. If it's set to utf8, MySQL will not support all characters. You set this connection encoding using mysql_set_charset(), the PDO charset DSN connection parameter or whatever other method is appropriate for your database API of choice.


mb_internal_encoding just sets the default value for the $encoding parameter all mb_* functions have. It has nothing to do with MySQL.

UTF-8 and UTF-32 differ in how they encode characters. UTF-8 uses a minimum of 1 byte for a character and a maximum of 4. UTF-32 always uses 4 bytes for every character. UTF-16 uses a minimum of 2 bytes and a maximum of 4.
Due to its variable length, UTF-8 has a little bit of overhead. A character which can be encoded in 2 bytes in UTF-16 may take 3 or 4 in UTF-8; on the other hand, UTF-16 never uses less than 2 bytes. If you're storing lots of Asian text, UTF-16 may use less storage. If most of your text is English/ASCII, UTF-8 uses less storage. UTF-32 always uses the most storage.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • "What you need to make sure is that the connection encoding between PHP and MySQL is set to utf8mb4". Right ... I have not done that! That must be the reason. – nourdine Jun 03 '13 at 10:34
  • Is it up to four or six? Joel Spolsky's article on this says 6: http://www.joelonsoftware.com/articles/Unicode.html – Tommy Aug 20 '15 at 19:49
  • @Tommy That's old and deprecated information. 6 isn't used in practice and has been discontinued. – deceze Aug 20 '15 at 19:52
  • so UTF-8 changed? It wasn't a standard? (I'm trying to understand this mess all now) – Tommy Aug 20 '15 at 19:54
  • The problem is with this is that if we're stuck with mysql UTF8, are we not unable to validate strings in PHP for MySQL without using the database? Wont a stringthat reads as valid UTF8 in PHP potentially have unrecognised sequences nuked in mysql? – jgmjgm Oct 10 '17 at 09:40
  • @jgmjgm You should know whether your MySQL instance supports `utf8mb4` or not. That's an important dependency and needs to be nailed down for your project. If it does support `utf8mb4`, then there should be no issue. If you only have `utf8` to work with, your application only supports BMP characters and you need to enforce/ensure that somewhere in your application layer. Or you switch to BLOB storage in MySQL and deal with the fallout of *that* in your application layer. Either way you should never be in a situation to leave that up to chance. – deceze Oct 10 '17 at 10:03
  • deceze, I refer to the situation where you have utfmb3 in mysql and it's not anything you can change quickly. There's no utf8mb3 validation in PHP. 'you need to enforce/ensure that somewhere in your' that's the annoyance. BMP and simply contraining to 3 bytes achieve the same. It's a kudgle where you'll likely want to use your own validation rather than purely mb_check_encoding. It can be difficult to switch to mb3 in mysql not only due to it being a large code change but it can also cause fields to hit index length limit. – jgmjgm Oct 13 '17 at 14:38
12

This is what i used, and worked good for my problem using euro € sign and conversion for json_encode failure.

php configurations script( api etc..)

header('Content-Type: text/html; charset=utf-8');
ini_set("default_charset", "UTF-8");
mb_internal_encoding("UTF-8");
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "UTF-8");

mysql tables / or specific columns

utf8mb4

mysql PDO connection

$dsn = 'mysql:host=yourip;dbname=XYZ;charset=utf8mb4';

(...your connection ...)

before execute query (might not be required):

$dbh->exec("set names utf8mb4");
Miguel
  • 3,349
  • 2
  • 32
  • 28
  • 3
    This isnt accurate though. If you are using 'set names utf8' then your connection is speaking to mysql in just a subset of *real* utf8. You need to 'set names utf8mb4' to speak in the full utf8 character set (including passing emojis). Otherwise mysql will often truncate a string at the point it encounters a utf character which is > 3 bytes – carpii Jan 17 '16 at 22:30
  • Super helpful. All I had to do was to set the `charset` to `utf8mb4` instead of `utf8`. Now my mysql database reads and writes emojis with no issues – Vahid Amiri Jul 01 '17 at 19:29
  • setting iconv encoding seems to be deprecated now. – Bimal Poudel Feb 09 '18 at 16:03
4
  • utf-32: This is a character encoding using a fixed 4-bytes per characters
  • utf-8: This is a character encoding using up to 4 bytes per characters, but the most frequent characters are coded on only 1, 2 or 3 characters.

MySQL's utf-8 doesn't support characters coded on more than 3 characters, so they added utf-8mb4, which is really utf-8.

Arnaud Le Blanc
  • 98,321
  • 23
  • 206
  • 194
3

Before running your actual query, do a mysql_query ('SET NAMES utf8mb4')

Also make sure your mysql server is configured to use utf8mb4 too. For more information on how, refer to article: https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4

James Gwee
  • 331
  • 2
  • 4