20

What extensions would you recommend and how should php be best configured to create a website that uses utf-8 encoding for everything. eg...

  • Page output is utf-8
  • forms submit data encoded in utf-8
  • internal processing of string data (eg when talking to a database) are all in utf-8 as well.

It seems that php does not really cope well with multibyte character sets at the moment. So far I have worked out that mbstring looks like an important extension.

Is it worth the hassle..?

Rik Heywood
  • 13,816
  • 9
  • 61
  • 81
  • I've successfully been using standard PHP installations with UTF-8 source files generating UTF-8 output including special UTF-8 chars like ♕ ⚐ and ✔ since 4.1.x. :) – Pascal Oct 22 '09 at 08:44
  • Getting correct UTF-8 output doesn't prove that your code is parsing **input** correctly and secured against malicious sequences. – Pacerier Oct 27 '14 at 08:03
  • **Update** Throughout this Q&A, consider using `utf8mb4` in MySQL instead of `utf8`. (Contrast, the non-MySQL term `UTF-8`.) – Rick James Jan 15 '18 at 18:57

6 Answers6

58

The supposed issues of PHP with Unicode content have been somewhat overstated. I've been doing multilingual websites since 1998 and never knew there might be an issue until I've read about it somewhere - many years and websites later.

This works just fine for me:

Apache configuration (in httpd.conf or .htaccess)

AddDefaultCharset utf-8

PHP (in php.ini)

default_charset = "utf-8"
mbstring.internal_encoding=utf-8
mbstring.http_output=UTF-8
mbstring.encoding_translation=On
mbstring.func_overload=6 

MySQL

CREATE your database with an utf8_* collation, let the tables inherit the database collation and start every connection with "SET NAMES utf8"

HTML (in HEAD element)

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
djn
  • 3,950
  • 22
  • 21
  • What does the "SET NAMES utf8" SQL statement actually do? – Rik Heywood Oct 23 '09 at 09:22
  • 1
    Straight from the MySQL docs: " A SET NAMES 'x' statement is equivalent to these three statements: SET character_set_client = x; SET character_set_results = x; SET character_set_connection = x;" This is handy because no matter which charset you use to store the data, the data still has to travel to and from PHP. One might never notice a problem while using a single computer (as in HTML FORM -> MySQL -> page), but using a devel machine to populate a db and moving it to the prod server to output it is risky, as the two may well have different client charsets. SET NAMES means portability. – djn Oct 23 '09 at 22:04
  • Can you still use PHP's string functions or you have to use the `mb_` ones ? – marco-fiset Nov 30 '12 at 15:37
  • Here's how I created my database: `CREATE DATABASE CHARACTER SET utf8 COLLATE utf8_general_ci;` – nc. Jul 14 '14 at 07:42
  • Do not use `set names` because it doesn't update the charset used for real_escape_string. See http://stackoverflow.com/questions/1317152/am-i-correctly-supporting-utf-8-in-my-php-apps#comment-41782034 – Pacerier Oct 27 '14 at 08:01
  • @djn If I could, I'd give you multiple +1s! Thanks! – EdwinW Feb 21 '15 at 14:54
  • @djn Can you please explain what does `mbstring.func_overload=6` do? I couldn't find the value 6 in here: http://php.net/manual/en/mbstring.overload.php – maxxon15 Mar 30 '15 at 16:08
  • What is [`mbstring.func_overload=6`](http://php.net/manual/en/mbstring.overload.php)? `6` isn't even listed as an option. – Geoffrey Hale Nov 18 '15 at 21:33
  • `mbstring.func_overload = 6` is ``mbstring.func_overload = 4` and ` `mbstring.func_overload = 2` combined, because the 1, 2, 4 options are __bitmasks__ .... quoting from the [PHP Docs](http://php.net/manual/en/mbstring.overload.php) that you linked, `To use function overloading, set mbstring.func_overload in php.ini to a positive value that represents a combination of bitmasks specifying the categories of functions to be overloaded.`, and then proceeds to give several examples of combinations – Mark Baker Sep 14 '16 at 19:25
  • Yes! it works :) Make sure you set UTF-8 everywhere. HTML, PHP, MYSQL..etc. Thanks for answer.. I am going to add my answer for Codeigniter.. – Nono May 26 '17 at 10:47
  • utf8mb4 for MySQL, please – Dmitry Jan 19 '21 at 15:52
  • `mbstring.func_overload=6` has been deprecated. – Lime Oct 20 '21 at 02:30
4

I was facing same issue for UTF-8 characters, Everything was working on live server and staging server, but sometime it's breaking on my dev machine. The behavior was so strange, some times characters was encoded properly but on random page reload it was start breaking with Diamond Charters '���เห็นอเวิลด์!���' or Question mark '??�เห็นอเวิลด์!???' or 85% data was rendering properly 'เห็นอเวิลด์!?��' but rest 15% was showing unmatched characters. I was looking to fix the issue. So, started with my checklist

1 - Check if Character Header Added in HTML


2 - Check if data proper saved in MySQL table


3 - Check if MySQL has proper encoding settings for UTF-8


4 - Check if Apache has Setting to deal with UTF-8 Character set


5 - Check if simple PHP can echo "เห็นอเวิลด์" output same as input "เห็นอเวิลด์"


6 - Check if PHP sending proper Headers output


7 - Check if MySQL Query getting same data "เห็นอเวิลด์"


8 - Check if "เห็นอเวิลด์" has some html characters, deal with them properly


9 - Check if "เห็นอเวิลด์" passing through any html encode decode function


10- Check if .htaccess all set to deal with UTF-8 Character set


Check all the above list to figure out where something..breaking.

Give a try (I am using Codeigniter):

=================================
:: PHP ini Settings::
=================================

default_charset = "utf-8"
mbstring.internal_encoding=utf-8
mbstring.http_output=UTF-8
mbstring.encoding_translation=On
mbstring.func_overload=6 

=================================
:: .htaccess Settings::
=================================

DefaultLanguage en-US
AddDefaultCharset UTF-8

=================================
:: HTML Header Page::
=================================

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

=================================
:: PHP Codeigniter index.php ::
=================================

header('Content-Type: text/html; charset=UTF-8');

=================================
:: Codeigniter config.php ::
=================================

$config['charset'] = 'UTF-8';

=================================
:: Codeigniter database.php ::
=================================

$db['default']['char_set'] = 'utf8';
$db['default']['dbcollat'] = 'utf8_general_ci';

=================================
:: Codeigniter helper function (optional)
=================================

if(!function_exists('safe_utf_string')){
    function safe_utf_string($utf8string= ''){
        $utf8string = htmlspecialchars($utf8string, ENT_QUOTES, 'UTF-8');
        return mb_convert_encoding($utf8string, 'UTF-8');
    }
}

and Finally don't forget to say Thanks! :) to @djn answer

Nono
  • 6,986
  • 4
  • 39
  • 39
  • You may need `utf8mb4` instead of `utf8` in MySQL. Can you provide the hex for the characters that became black diamonds? Or the characters that they should have been there? When the hex is 4 bytes: `F0xxyyzz`, utf8 will not suffice; utf8mb4 is required. – Rick James Jan 15 '18 at 18:56
2

php copes just fine!

You should set the php.ini "default_charset" parameter to 'utf-8'.

The make sure that:-

<head>
  <meta http-equiv="Content-Type"
    content="text/html; charset=utf-8"
    />

is at the top of every page you serve.

There are a few problem areas:

Databases -- make sure they are configured to use utf-8 by default or enter a world of pain.

IDEs/Editors -- a lot of editors don't support utf-8 well. I normally use vim which doesn't but its never been a big problem.

Documents -- just spent a whole afternoon getting php to read Thai characters out of a spreadsheet. I was eventually successful but am still not sure what I did right.

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
James Anderson
  • 27,109
  • 7
  • 50
  • 78
2

2018 Update :::

Kindly note that these php.ini entries are DEPRECATED;

;mbstring.internal_encoding = utf-8
;mbstring.http_input =
;mbstring.http_output = utf-8

Next ...

PHP - Set utf8 for the following - via a config.php file for your web app

 ini_set('default_charset', 'UTF-8');                                    
 mb_internal_encoding('UTF-8');
 iconv_set_encoding('internal_encoding', 'UTF-8');
 iconv_set_encoding('output_encoding', 'UTF-8');

MariaDB / MySQL - Set utf8 via:

 mysqli::set_charset ( "utf8mb4" );

HTML Pages - Set via:

 <meta charset="utf-8" > 
MarcoZen
  • 1,556
  • 22
  • 27
1

If mbstring isn't already part of your PHP package, then I definitely would recommend it to you - you'll even want to use it for calculationg string lengths ( mb_strlen($string_var, 'utf8') ) for form input... Else you won't need anything except valid and proper HTML, a correct http-server-config (so the server will deliver pages unsing utf-8) and a text editor with utf-8-support (e.g. Notepad++).

RSeidelsohn
  • 1,149
  • 18
  • 33
1

In your php.ini, set

mbstring.internal_encoding = UTF-8
mbstring.encoding_translation = On

so that you don't need to pass an encoding parameter to the mb_ functions every time.

Tapper
  • 1,393
  • 17
  • 28
Ben James
  • 121,135
  • 26
  • 193
  • 155