0

I have a small JSON object that I'd like to send to php to put in a mySQL database. Part of the information in the string is html entities. &emdash is giving me problems. It is showing up as â€. There are some other problems with é displaying as é.

I seem to be having some encoding problems. Any idea what could be wrong? Thanks

1252748
  • 14,597
  • 32
  • 109
  • 229
  • 2
    Probably the character set of your database. At a guess, it is set to iso-8859-* and because the data is coming from JSON it will be encoded in unicode. Change your database character set to UTF-8. – DaveRandom Aug 29 '12 at 15:42
  • @DaveRandom Thanks! sorry to be daft, but is that something i set table by table, or database by database, or just for my entire mySQL? – 1252748 Aug 29 '12 at 16:12
  • It can be set on a per-database, per-table or even per-column basis. For this specific task you only need to do it for the column(s) where you are storing this data, but I'd recommend you convert the whole DB if that doesn't clash with anything else. – DaveRandom Aug 29 '12 at 16:17
  • @DaveRandom Under 'operations' in PHPMyAdmin, I see that I have 'collation' set to 'latin1_swedish_ci'. There is an option for UTF-8 in a massive dropdown. Just a bit confused because it's called 'collation' not 'encoding'; is this what I want? – 1252748 Aug 29 '12 at 16:21
  • Yes, but that won't affect the existing columns in the table, it will only affect new ones you create and don't specify an encoding for. Change it there, and also go into the table and edit all your VARCHAR/CHAR/TEXT columns as well. – DaveRandom Aug 29 '12 at 16:34
  • @DaveRandom Cool. Do you wanna submit that as an answer? I think it'll solve my problem totally. – 1252748 Aug 29 '12 at 16:48

1 Answers1

2

Because the data is coming from JSON, it should be encoded in a Unicode character set, the default being UTF-8 [Sources: Douglas Crockford, RFC4627].

This means that in order to store a non-ASCII character in your database, you will either need to convert the encoding of the incoming data to the character set of you database, or (preferably) use a Unicode character set for your database. The most common Unicode character set - and the one I'd recommend you use for this purpose - is UTF-8.

It is likely that your database is set up with one of the latin character sets (ISO-8859-*), in which case you will most likely simply need to change the character set used for your table and it won't break any of your existing data - assuming that you currently have no records that use any characters outside the lower 128. Based on you comments above, you should be able to make this change using phpMyAdmin - you will need to ensure that you change each existing column you wish to alter explicitly, changing the character set of a table/database will only affect new columns/tables that are created without specifying a character set.

When you are outputting data to the client, you will also need to tell it that you are outputting UTF-8 so it knows how to display the characters correctly. You do this by ensuring you append ; charset=utf-8 to the Content-Type: header you send along with text-based content.

For example, at the top of a PHP script that produces HTML that is encoded with UTF-8, you would add this line:

header('Content-Type: text/html; charset=utf-8');

It is also recommended that you declare the character set of the document within the document itself. This declaration must appear before any non-ascii characters that exist within the document - as a result, it is recommended that you place the following <meta> tag as the first child of the <head>:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If you are producing XHTML with an XML declaration at the top, the character set may be declared there, instead of using a <meta> tag:

<?xml version="1.0" encoding="UTF-8" ?>

Remember, the use of a character set definition in the Content-Type: header is not limited to text/html - it makes sense in the context of any text/* family MIME type.

Further reading: What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

Also, make sure you validate your markup.

DaveRandom
  • 87,921
  • 11
  • 154
  • 174
  • Why would the default for my mySQL be set to `latin1_swedish_ci ` anyway? ^^ – 1252748 Sep 04 '12 at 20:18
  • Also, `utf8_unicode_ci` was the closest item in my list to the UTF-8 you suggested. There were actually about as many UTS-8's as there are languages. Though not English strangely. Do you think this is selection is satisfactory. There was also UTF8-bin. Thanks again for your help and great answer! – 1252748 Sep 04 '12 at 20:20
  • @thomas The issue of the default being Swedish is addressed [here](http://stackoverflow.com/questions/3936059/why-does-mysql-use-latin1-swedish-ci-as-the-default) - to sum up, "nobody really knows". If you really want to know what the difference between the various UTF-8 collations is, all the information you need can be found [here](http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.html). I personally have better things to do with my life... ;-). The long of the short of it is that if all you are bothered about supporting is western text and "advanced" punctuation, any will do. – DaveRandom Sep 04 '12 at 20:29
  • If you're not worried, I'm not worried. :) Thanks for all the help. This fixed everything. – 1252748 Sep 04 '12 at 20:38