61

My page often shows things like ë, Ã, ì, ù, à in place of normal characters.

I use utf8 for header page and MySQL encode. How does this happen?

Community
  • 1
  • 1
Leonardo
  • 2,273
  • 6
  • 29
  • 32
  • 1
    You need to add more context. Where do these characters show up, what encoding are your tables in, what does the code look like to retrieve the data.... – Pekka Feb 26 '11 at 15:28
  • 9
    These are UTF-8 sequences when displayed on a Latin-1 charset website. The best option is to add `` to your pages, or use `header("Content-Type: text/html; charset=utf-8");` on top of your PHP scripts. I assume this isn't actually the case yet. – mario Feb 26 '11 at 15:37

4 Answers4

49

These are utf-8 encoded characters. Use utf8_decode() to convert them to normal ISO-8859-1 characters.

Ray
  • 1,192
  • 6
  • 10
  • 5
    This may happen to fix the problem at hand, but it is much, much better to get all encodings in the process right in the first place. – Pekka Feb 26 '11 at 15:30
  • 1
    I always use utf8_encode() (and mysql_real_escape_string of course) when sending a string to database. At the output page is use utf8_decode(). But you say that's wrong, I didn't know that, how would you deal with this? – Ray Feb 26 '11 at 15:33
  • 6
    utf8_encode() and utf8_decode convert data from and to ISO-8859-1. In a modern web site setup where the database, the database connection, and the output page encoding are UTF-8, it will not be necessary to do those conversions any more. That is the recommended way when building PHP projects from scratch. While it would probably fix the problem the OP shows, fixing the problem at its root (if possible) is much preferable. – Pekka Feb 26 '11 at 15:44
  • And you may need even to use it twice – javier_domenech Feb 24 '15 at 10:58
28

If you see those characters you probably just didn’t specify the character encoding properly. Because those characters are the result when an UTF-8 multi-byte string is interpreted with a single-byte encoding like ISO 8859-1 or Windows-1252.

In this case ë could be encoded with 0xC3 0xAB that represents the Unicode character ë (U+00EB) in UTF-8.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • how encoded with 0xC3 0xAB that represents the Unicode character ë (U+00EB) in UTF-8 ?? – Leonardo Apr 05 '11 at 21:24
  • 2
    The character `ë` has the code point 0xEB in the Unicode character set and is encoded with 0xC3AB in UTF-8. But this byte sequence does represent something different when interpreted with a different character encoding. For example, in ISO 8859-1 and Windows-1252 it represents the two characters `Ã` (0xC3) and `«` (0xAB). – Gumbo Apr 06 '11 at 08:09
16

Even though utf8_decode is a useful solution, I prefer to correct the encoding errors on the table itself. In my opinion it is better to correct the bad characters themselves than making "hacks" in the code. Simply do a replace on the field on the table. To correct the bad encoded characters from OP :

update <table> set <field> = replace(<field>, "ë", "ë")
update <table> set <field> = replace(<field>, "Ã", "à")
update <table> set <field> = replace(<field>, "ì", "ì")
update <table> set <field> = replace(<field>, "ù", "ù")

Where <table> is the name of the mysql table and <field> is the name of the column in the table. Here is a very good check-list for those typically bad encoded windows-1252 to utf-8 characters -> Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters.

Remember to backup your table before trying to replace any characters with SQL!

[I know this is an answer to a very old question, but was facing the issue once again. Some old windows machine didnt encoded the text correct before inserting it to the utf8_general_ci collated table.]

davidkonrad
  • 83,997
  • 17
  • 205
  • 265
5

I actually found something that worked for me. It converts the text to binary and then to UTF8.

Source Text that has encoding issues: If ‘Yes’, what was your last

SELECT CONVERT(CAST(CONVERT(
    (SELECT CONVERT(CAST(CONVERT(english_text USING LATIN1) AS BINARY) USING UTF8) AS res FROM m_translation WHERE id = 865) 
USING LATIN1) AS BINARY) USING UTF8) AS 'result';

Corrected Result text: If ‘Yes’, what was your last

My source was wrongly encoded twice so I had two do it twice. For one time you can use:

SELECT CONVERT(CAST(CONVERT(column_name USING latin1) AS BINARY) USING UTF8) AS res FROM m_translation WHERE id = 865;

Please excuse me for any formatting mistakes

  • Life saver! Thanks. – Patrick Savalle Oct 20 '21 at 14:59
  • 1
    Thanks for this - I just had a case of a garbled text file of translations exported from an old system, that stumped me for a while because it wasn't the usual utf-8 <-> windows-1252/iso-8859 mixup. Your idea helped me discover that the problem was that the source which was originally utf-8 had been mistakenly double 'converted' to utf-8. Opening it in Notepad++ and reading it as utf-8 encoding and converting to ANSI, then reading it as utf-8 encoding again and converting it to ANSI again, and finally reading it as utf-8 encoding, solved it. – Sev Roberts Nov 11 '22 at 16:06