0

I have database table as follows:

CREATE TABLE `tbl_jobs` (
   `job_id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
   `job_title` VARCHAR(100) NOT NULL,
   `job_salary` VARCHAR(150) NULL DEFAULT NULL,
   `job_desc` TEXT NOT NULL,
  PRIMARY KEY (`job_id`),
COLLATE='utf8_general_ci'
ENGINE=InnoDB

In one of the records I have the salary stored as €30,000 plus excellent benefits so in the webpage it should render as €30,000 plus excellent benefits. As you can see € is a valid utf8 character.

My database connection is as follows:

`mysql:host=myhost;dbname=mydatabase;charset=utf8`

I have the following meta tag in the head section of my webpage:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

However on the webpage it is still rendering as €30,000 plus excellent benefits. If I change the connection charset to latin1 as follows it renders correctly.

`mysql:host=myhost;dbname=mydatabase;charset=latin1`

Why is it behaving like this considering it is utf8 character?

adam78
  • 9,668
  • 24
  • 96
  • 207

3 Answers3

0

I am guessing that information was saved previously using latin1 charset, so when printing it rendered as utf8, you get the incorrect data.

If the data was saved using utf8, then you would see the correct data on the table as well.

JoeGalind
  • 3,545
  • 2
  • 29
  • 33
  • and not an expert in encoding, is `€` not utf8? If it was utf8 what would it be saved as in the database. The database collation is `utf8_general_ci` so why would it end up as latin1? – adam78 Jan 14 '20 at 18:50
  • what I mean is that € stored in UTF-8 is exactly that. No translation is done. When you try to read it as latin1, that is translated into the € sign. If you save the same data using UTF-8 and you enter the € sign, then when reading it, no translation is made and you get the € sign printed out. – JoeGalind Jan 15 '20 at 14:14
0

In latin1, â is represented by the 0xE2 byte, comma (,) as 0x2C, ¬ as 0xAC.

In Unicode € is U+20AC which is encoded into UTF-8 as 0xE2 0x82 0xAC.

I think what is happening is that when your connection is latin1, the database access is sending 0xE2 0x82 0xAC to the browser which is expecting UTF-8 and rendering it as the Euro symbol, €. However, when you tell the database to use UTF-8 for the connection, it is representing this as the unicode characters U+00E2 (â) U+0082 (BREAK PERMITTED HERE) U+00AC (¬), and sending that to the browser encoded in UTF-8 which it displays as "â,¬".

The solution is to store the Euro symbol (€) in the database, set the connection to UTF-8, and it will display as the Euro symbol on the web page.

Marlin Pierce
  • 9,931
  • 4
  • 30
  • 52
  • in the html form if a user enters `â,¬` and with the database connection set to utf8, the table and column charset set to utf8 why does it still get saved as `â,¬` ? – adam78 Jan 15 '20 at 10:29
  • Reread what you wrote and see if it makes sense to you. You entered `â,¬` and it stored `â,¬`. It did what you told it to. It interpreted `â,¬` as UTF-8 the way you configured the connection, and it stored what you sent it. – Marlin Pierce Jan 15 '20 at 11:08
  • sorry I'm not an expert on encoding. If user entered the Euro symbol in latin1 format, why won't the database convert it into utf-8 if both the connection and table/column charset is set to utf8? – adam78 Jan 15 '20 at 11:19
  • I suppose a follow up question is how do you prevent user submitting latin1 characters `â,¬` when submitting form? – adam78 Jan 15 '20 at 11:23
  • There is no way to enter a string in a form "as latin1". The form will interpret the input in whatever encoding the form uses, which in modern times will be UTF-8. – Marlin Pierce Jan 15 '20 at 11:31
0

You have "Mojibake" or "double encoding". See Trouble with UTF-8 characters; what I see is not what I stored

Do SELECT col, HEX(col) FROM ... to see what got stored. It sounds like the client had encoded in UTF-8. In particular, the 3 bytes whose hex is E282AC. But the connection parameters said that the client was using latin1. So as those 2 bytes were being sent to the server they were translated from the E2 82 AC treated as latin1 (€) into the utf8 equivalent, still €, but now encoded as C3A2 E2809A C2AC.

Rick James
  • 135,179
  • 13
  • 127
  • 222