1

I was storing data in my MySQL database tables using htmlentities() but I see that this is inadvisable, so I'm changing it.

I'm now storing data (using PDO prepared statements) in raw format, and converting it as necessary on output. In some instances I want the data to be output as HTML.

The charset of all my web pages which take HTML output is set as utf-8. However I notice on this post that that poster Mythli remarked that, "There is no need to [input to the database using] htmlspecialchars() as you want to display html and no need to [input to the database using] htmlentities() if the character set of your site is equal to the one you use in your database."

I notice that my database tables seem to have the "default character set" set as latin1.

Should I change it to "utf8" ?

Community
  • 1
  • 1
John Doe
  • 97
  • 2
  • 10

2 Answers2

1

Yes, if you are storing UTF-8 data in your database, you should store them in a UTF-8 table/column. Otherwise, there are two scenarios:

  1. You inform the database that you are going to insert data encoded in UTF-8 (see SET NAMES utf8). If the column the data is supposed to be stored in is set to store latin1, MySQL will treat the encodings correctly and convert your input from UTF-8 to latin1. This results in the loss of all characters that cannot be represented by latin1.
  2. You are not informing the database that you are going to insert UTF-8 data, the database expects latin1 encoded data. In this case the database will misinterpret the incoming data as individual bytes, which happens to work though since MySQL will then simply store individual bytes. They will look like garbage in the database, but the roundtrip to PHP will work transparently.

So, better tell your database that you're going to store UTF-8 in a UTF-8 column and you'll have a clean roundtrip of your data. Also see Handling Unicode Front To Back In A Web App.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • Thank you for that info. I guess it's OK to set the database's charset value on a per-column basis? – John Doe Jun 13 '12 at 13:24
  • I'd set a default for each table, if possible even for the whole database. But yes, in the end the column setting is the deciding factor. – deceze Jun 13 '12 at 13:25
0

Yes, but MySql has several levels at which charset may be different, and collation isn't actually the most important one

Checklist to ensure things work nicely:

1) table column collation (field where you're storing your html) should use utf-8

2) your connection to the database. The following takes care of more than one gotcha:

mysqli_query("SET NAMES 'utf8'");

3) as you say, altering the default charset of the database and table is also a good idea. You will probably desire the collation to be case-insensitive (utf8_unicode_ci)

4) None of this will matter until the day non-latin characters are inserted so there's little point in doing this without testing.

Here's a nice breakdown of which values to be consistent, and a string to test with: http://adviesenzo.nl/examples/php_mysql_charset_fix/

When Iñtërnâtiônàlizætiøn comes out the way it went in, you got it.

Felix Weelix
  • 352
  • 3
  • 7
  • Thanks for the great info. The trouble is with `SET NAMES`is that I'm using PDO statements, and, if I have it right, the correct way for PDO statements is to use `charset=utf8` in the connection string. However, as I understand it, this only works for PHP versions greater than version 5.3.6, but I'm using version 5.2.17, and I can't use an `if` construct to connect based on the MySQL version since I can't know the MySQL version without first connecting to it. Not sure what the outcome would be if I used `$dbh -> exec("set names utf8");` as well as `charset=utf8` in the connection string. – John Doe Jun 13 '12 at 20:25