0

I have a database which has the latin1 default characterset - info obtained by running the following statement:

  SELECT default_character_set_name FROM information_schema.SCHEMATA 

WHERE schema_name = "schemaname";

The default character set for each table and column in this database is set to utf8.

When I look at the data in the tables I can see data is stored as utf8 e.g the currency symbol is stored in the table as €. Similarly apostraphes are stored as ’ etc.

On the web frontend I have the following meta tag and so the characters render correctly.

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> 

However I'm also seeing a lot of � symbols on the webpage which I don't see inside the database?

When I change the database connection to include the charset utf8 as follows: mysql:host=myhost;dbname=mydatabase;charset=utf8, the diamond symbols disappear but then all the other utf8 characters revert to exactly how they are saved in the database e.g. the symbol renders as € on the webpage?

Why is this happening?

  1. How do I fix this and also change character set to utf8mb4?

Any help appreciated.

* UPDATE *

Tried the following steps:

  1. for the database:

    ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

  2. For each table:

    ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

  3. For each column:

    ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Not sure if Step 3 is necessary since when I do SHOW CREATE TABLE after step 2, whilst the definition doesn't display the column charset it does display the default charset for the table as utf8mb4. As a sanity check I did run step 3 on one of the tables columns but it makes no difference - € is being rendered on the page as € with db connection as follows:

`mysql:host=myhost;dbname=mydatabase;charset=utf8mb4`

I had to run the following on each column I wanted converting which seems to fix some issues

 UPDATE tbl_profiles SET profile =
 convert(cast(convert(profile using latin1) as binary) using UTF8MB4);

but still seeing characters such as Iâm and «Âand ⢠rendered on the webpage

Any ideas?

* UPDATE 2 *

After running steps 1 and 2 above I have a table column as follows:

`job_salary` VARCHAR(150) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',

The following query on this column returns the following result:

SELECT job_salary FROM tbl_jobs WHERE job_id = 2235;

€30,000 plus excellent benefits 

I execute the following statement on this column:

UPDATE tbl_jobs SET job_salary = CONVERT(BINARY(CONVERT(job_salary  USING latin1)) USING utf8mb4);

But I get the following error which means some other record has a invalid utf8mb4

Invalid utf8mb4 character string: '\x8010000 to \x8020000 Per: annum'
adam78
  • 9,668
  • 24
  • 96
  • 207
  • `’` is "Mojibake"; `�` is a "black diamond" -- cf the dup Q&A. – Rick James Jan 12 '20 at 07:11
  • @RickJames thanks for the link to the detailed post, very useful. With regards to the existing data, simply changing the database and table collation doesn't convert the data. So is my updated post the correct way forward to rectify old data? – adam78 Jan 12 '20 at 09:59
  • True -- those are _defaults_ for new tables and columns in the database and table. As for fixing the data, there used to be a link on that page. Here it is: http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases. 5 cases are fixable in 5 different ways; you have not provided enough info for me to say which one is correct. – Rick James Jan 14 '20 at 03:54
  • @RickJames please see my updated post. I ran the above steps prior to you providing the link. Will I have double encoded it now? – adam78 Jan 14 '20 at 10:50
  • @RickJames original character set and collation was utf8_general_ci and I need to convert to utf8mb4 – adam78 Jan 14 '20 at 14:22
  • If you are seeing ’ or � with utf8, that needs fixing first. If you are seeing that only after conversion to utf8mb4, then the conversion was incorrectly performed. – Rick James Jan 14 '20 at 19:07
  • Please provide `SELECT col, HEX(col) ...` for some bad text. That will help me diagnose whether the data was stored incorrectly (versus being interpreted incorrectly). And it will probably point me at the correct "fix". – Rick James Jan 14 '20 at 19:12
  • And another thing... Was the table originally `latin1`, then updated to `utf8`? That is usually when the problem starts. – Rick James Jan 14 '20 at 19:13
  • The "RIGHT SINGLE QUOTATION MARK" `’` is these 3 hex bytes: E28099. Interpreting those bytes as latin1, yields `’`. If you repeat this mis-interpretation, you get `’`. Similarly: `€` -->`€` --> `€`. The single-quote may have come from a word processing app. The other is the Euro sign. – Rick James Jan 14 '20 at 19:22
  • @RickJames the table was originally utf8, so why would a `€` sign be saved as `€` which is latin1? – adam78 Jan 14 '20 at 20:56
  • @RickJames Also I've attempted your fix using the mysqldump method. But the `€` sing still ends up being saved as `€` why is that? Is it because the column contains both latin1 and utf8 characters that it's unable to convert? – adam78 Jan 14 '20 at 21:12
  • @RickJames please see my update 2 in original post - why im i getting this error? – adam78 Jan 14 '20 at 21:52
  • I really need to see the hex that is in the table. – Rick James Jan 14 '20 at 23:10

1 Answers1

0

First, let's discuss the Mojibake of the Euro sign. All of this applies to both utf8 and utf8mb4, since the Euro is encoded the same way and there is.

It is very likely that the data was initially stored incorrectly. If you can get back to the INSERT program, let's check for:

  • The bytes to be stored need to be UTF-8-encoded. What was the client programming language? Where did the data come from?
  • The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Do you have the connection parameters?
  • The column needs to be declared CHARACTER SET utf8 (or utf8mb4). It sounds like this was always correct.
  • HTML should start with .

What is currently in the table?

SELECT col, HEX(col) FROM ... WHERE ...

A correctly stored Euro sign () should have hex E282AC. (Interpreting that as latin1 yields €.

If instead, you see hex C3A2E2809AC2AC, you have "double encoding", and the display is probably €.

I have identified several possible fixes, but have not yet determined which applies in your case. The likely candidate is

  • CHARACTER SET utf8mb4 with double-encoding:

To verify it (before fixing it), please do something like:

SELECT col,
       CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4),
       HEX(    
          CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4)
          )
    FROM ...
    WHERE ...

Do not apply a fix on top of another fix. I have struggled for a long time to decipher how character set problems occur and what to do to 'fix' a single problem. But when the wrong fix is applied, I am at a loss to unravel the mess.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • I'll have to do some more digging and get back to you. In the mean time how do you prevent user submitting latin1 characters in from input. Even if I have the connection set to utf8mb4 and database/table/column set to utf8mb4 a user submiting `€` will still get saved as `€` in the table and so we are back to square one. This will happen even when you add `accept-charset="utf-8"` to the form. – adam78 Jan 15 '20 at 11:31
  • @adam78 - How do users submit things? Via HTML form? Directly connect to mysql? Some other API? – Rick James Jan 15 '20 at 16:42