0

When I save certain characters to my MariaDB database, they are being stored as a string of 2-4 characters.

For example, is saved in the database as â„¢.

From the head of the html page that contains the form I have

<meta charset="utf-8">

When I check the headers sent to the server, it shows

Content-Type:text/html; charset=UTF-8

The PHP file handling the form data has

$mysqli->set_charset('utf8');

I've converted the database and table to UTF8, by doing:

ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Checking the database and table details in information_schema table I can see that the database and table character set is UTF8 and default collation name is utf8_general_ci.

But the wrongly encoded characters are still being saved to the db.

I'm viewing the table data with terminal, so there should be no issue with the encoding on the viewing side.

What else am I missing?

== EDIT ==

In response to @rick's comments - SHOW CREATE TABLE search_templates;:

search_templates | CREATE TABLE `search_templates` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `modified` timestamp NULL DEFAULT NULL,
  `template` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `space` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `url` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `secure` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=29 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

and SELECT title, HEX(title) FROM search_templates WHERE id = 25;:

+-----------------------+--------------------------------------------+
| title                 | HEX(title)                                 |
+-----------------------+--------------------------------------------+
| Kiddicare abc â„¢     | 4B69646469636172652061626320C3A2E2809EC2A2 |
+-----------------------+--------------------------------------------+

When passed through hex-to-text converter it shows: Kiddicare abc â„¢.

So these characters are being saved to the DB like this, does this make the issue any clearer?

== EDIT ==

On advice from @rick (with freshly added data, as previously title column got wiped):

SELECT title, CONVERT(BINARY(CONVERT( CONVERT(UNHEX(title) USING utf8) USING latin1)) USING utf8) FROM search_templates LIMIT 3;

->

+--------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+
| title                                                                          | CONVERT(BINARY(CONVERT( CONVERT(UNHEX(title) USING utf8) USING latin1)) USING utf8) |
+--------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+
| Kids' Games & Toys » Toy Shop » Toys R Us                                    | NULL                                                                                |
| Kiddicare - Pushchairs | Prams | Car Seats | Baby Equipment | Online Baby Shop | NULL                                                                                |
| Kiddicare â„¢                                                                  | NULL                                                                                |
+--------------------------------------------------------------------------------+-------------------------------------------------------------------------------------+
Fisu
  • 3,294
  • 9
  • 39
  • 61
  • To help get to the bottom of it: `SHOW CREATE TABLE` and `SELECT col, HEX(col) FROM table WHERE ...` – Rick James Jun 29 '15 at 20:52
  • Updated my question - at least I now know that it's being converted before being saved to the DB, but am I not already ensuring that everything is set to UTF8 before going to the DB? – Fisu Jul 01 '15 at 05:57
  • Oh, if you can re-add the data, then let's look at how to be "utf8" at all steps. Start by studying http://stackoverflow.com/questions/279170/utf-8-all-the-way-through/279279#279279 – Rick James Jul 03 '15 at 15:06
  • Is the data encoded utf8? Is the table (or at least `title`) `CHARACTER SET utf8`? Is the connection utf8 (`SET NAMES utf8` or equivalent)? – Rick James Jul 03 '15 at 15:08
  • If the file you are loading from has `C3A2E2809EC2A2` for TradeMark, then we need to fix it. And that might come full circle to the convoluted `CONVERT` I gave you. -- But that might need to be fixed _before_ inserting into `title`. – Rick James Jul 03 '15 at 15:10
  • I've been through the linked question, and no closer to a solution. The `™` symbol along with many others that are getting scrambled: `¡”¥¢‰¶`, are coming from my keyboard when I type them into the form that's on a HTML page with ``, being sent to the server with UTF8 headers and the handling PHP script sets the correct charset. So I'm not sure what else to try. – Fisu Jul 05 '15 at 07:27

1 Answers1

0
CONVERT(BINARY(CONVERT(
    CONVERT(UNHEX('C3A2E2809EC2A2') USING utf8)
    USING latin1)) USING utf8) ==> '™'

You have a "double-encoded" TradeMark at the end of that string.

This is a duplicate of

Updating data in MySQL database after inserting in the wrong encoding

Edit

In context, avoid the innermost UNHEX; do

CONVERT(BINARY(CONVERT(
    CONVERT(title USING utf8)
    USING latin1)) USING utf8)

Example

CONVERT(BINARY(CONVERT(
    CONVERT('Kiddicare â„¢' USING utf8)
    USING latin1)) USING utf8)
-->  Kiddicare ™

However, if you can reload the data and if the data is utf8-encoded (which it seems not to be), then having utf8 all the way through is the 'right solution'.

Community
  • 1
  • 1
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • So would the full code to convert the column be `UPDATE search_templates SET title = CONVERT(BINARY(CONVERT(CONVERT(UNHEX('C3A2E2809EC2A2') USING utf8)USING latin1)) USING utf8) ==> '™';` as that's throwing an error. I assume this will convert any future `™` symbols, but what about other symbols that get put into the database. How do I ensure that they won't get scrambled? – Fisu Jul 02 '15 at 05:12
  • No. `UPDATE search_templates SET title = CONVERT(BINARY(CONVERT( CONVERT(UNHEX(title) USING utf8) USING latin1)) USING utf8);` But _please_ test it before applying it to the entire table. – Rick James Jul 02 '15 at 05:22
  • That empties the entire `title` column. – Fisu Jul 02 '15 at 06:53
  • What happens with `SELECT title, CONVERT(BINARY(CONVERT( CONVERT(UNHEX(title) USING utf8) USING latin1)) USING utf8) FROM search_templates LIMIT 3;`? – Rick James Jul 02 '15 at 19:40
  • question updated with results from this query. I truncated the table and added data again as title column was empty. – Fisu Jul 03 '15 at 05:52
  • My apologies; there was an unnecessary `UNHEX()` in the expression left over. See edit. – Rick James Jul 04 '15 at 14:37
  • Yes, thanks, that works to convert the column to the correct characters. However, you're right that figuring out where in the chain something other than UTF8 is being used is the correct way to go. – Fisu Jul 05 '15 at 07:08