5

I have the following test script on my server:

<?php
echo "Test is: " . $_GET['test'];
?>

If I call it with a url like example.com/script.php?test=ɿ (ɿ being a multibyte character), the resulting page looks like this:

Test is: É¿

If I try to do anything with the value in $_GET['test'], such as save it a mysql database, I have the same problem. What do I need to to do make PHP handle this value correctly?

takteek
  • 7,020
  • 2
  • 39
  • 70

3 Answers3

4

Have you told the user agent your HTTP response is UTF-8?

header ('Content-type: text/html; charset=utf-8');

You might also want to ensure your HTML markup declares the encoding also, e.g.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

For your database, are your tables and mysql client settings set up for UTF-8? If you check your database using a mysql command line client, is your terminal environment set up to expect UTF-8?

In a nutshell, you must check every step: from the raw source data, the code which touches it, the storage systems which retain it, and the tools you use to display and debug it.

Paul Dixon
  • 295,876
  • 54
  • 310
  • 348
  • 2
    If the document is stored in another retrieval system, the original HTTP headers are lost - for example, if you save the HTML to a local hard disc. – Paul Dixon Jan 30 '10 at 13:24
  • Yeah, I mean what's the point in using the first `header()` call? The meta tag does the same. – Alix Axel Jan 30 '10 at 13:26
  • 1
    If the default_charset ini parameter is set php sends a content-type header including the charset. http clients (usually) prefer the http header over the http-equiv setting. So you might want to avoid ambiguities/errors caused by different ini settings and make the charset explicit in both the http header and the meta/http-equiv element. – VolkerK Jan 30 '10 at 13:30
  • 2
    I would say simply "because you can", but there maybe more justification beyond that :) One thing it does allow you to do is probe the content type of the request via HEAD request. – Paul Dixon Jan 30 '10 at 13:31
  • VolkerK put it better than I, +1 to that! – Paul Dixon Jan 30 '10 at 13:32
  • Adding that header causes it to display correctly on the resulting page, but doesn't help my database problem. What do I need to do other than set the collation to utf8_unicode_ci on the table and database? (and column) – takteek Jan 30 '10 at 13:32
  • 1
    @takteek: that depends a bit on the API you're using to connect to the mysql server. If you're using mysql\_connect() (i.e. the php-mysql extension) search Stackoverflow for mysql\_set\_charset() – VolkerK Jan 30 '10 at 13:38
  • Ah, mysql_set_charset('utf8') fixed everything. Thanks. I think this was probably a case where I would have found that if I just looked for 5 more minutes. I got impatient since it's 5:30 AM. :) – takteek Jan 30 '10 at 13:42
  • 2
    After a restful ...nap you might be interested in http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html to learn more about what mysql\_set\_charset() does and why `SET NAMES 'utf8'` is not the whole story when using mysql\_query() (it doesn't notify the client lib about the change of character encoding which may -in rare cases- lead to wrong results of mysql\_real\_escape\_string()). I _guess_ `SET names` is safe when using prepared statements (mysqli, pdo). – VolkerK Jan 30 '10 at 13:46
  • @VolkerK: +1 Wow, my foundations just got shattered... Mind commenting on http://stackoverflow.com/questions/1933411/mysql-and-utf-8 please? – Alix Axel Jan 30 '10 at 14:06
  • 1
    @Alix: To glue the shards of your foundation back together again, I'm not even sure if this can be exploited when switching the conn.charset from latin1 to _utf8_. It may be but I guess it's not time for _panic(!)_,yet. ;-) Chris Shiflett used the GBK (simplified Chinese) charset for his demo. Take a look at http://ilia.ws/archives/103-mysql_real_escape_string-versus-Prepared-Statements.html which is a "reply" to the addslashes vs. real\_escape\_string() article. mysql\_set\_charset() was introduced with php 5.2.3 (31-May-2007), after both articles (the latter was published January 22. 2006). – VolkerK Jan 30 '10 at 15:10
  • @VolkerK: Thanks! From what I've understood it seems that it's safe to use UTF-8, they also only seem to mention the danger of the `SET CHARACTER SET` query, they don't go into much detail about `SET NAMES`. – Alix Axel Jan 30 '10 at 15:27
  • Neither sql statement informs the client lib. I only have a basic understanding of utf-8 and don't _know for sure_ whether this can be exploited when switching from latin1 to utf8 or not. But since calling set\_charset() instead of mysql\_query('SET...) doesn't introduce more complexity and closes a potential hole I'd definitely prefer the safe route here. I prefer prepared statements anyway ;-) – VolkerK Jan 30 '10 at 15:41
1

UTF-8 all the way through…


Follow the steps, specifically:

  • SET NAMES 'utf8' upon connection to the MySQL DB
  • <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in your HTML
Community
  • 1
  • 1
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
0

By pasting url in browser which cotains high utf8 chars, browser will recode utf8 chars into a multibyte sequence compliant with RFC 3986 and you won't get utf8 chars in php.

BUT, php will get and display utf8 chars from url correctly, if page which calls your url is utf8 encoded.

Try calling your php for test like this:

<iframe src="example.com/script.php?test=ɿ" height="100" width="100" border="1"></iframe>
seven
  • 2,388
  • 26
  • 28