4

I have a db in UTF-8 encoding with a mixture of Latin-1. (I think that that is the problem)

This is how the characters look in the database.

İ (should be İ)
è

When I set the header to

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

Then the characters come out as:

 İ
 �

When I remove the header, they come out as they are in the database. I want them to come out like this:

 İ
 è

I'm looking for a way to remedy this in PHP after the fact, if it is possible. I am unable to correct the data itself at this time, which would be the correct thing to do.

Marcel Korpel
  • 21,536
  • 6
  • 60
  • 80
Paul Stanley
  • 4,018
  • 6
  • 35
  • 56
  • How can you have two different encodings in one db table? – Voitcus Apr 23 '13 at 09:24
  • You need to pick an encoding and stick to it. You can't output a mixture of character sets. Personally I would say the right answer here is to convert your entire database to Unicode and be done with it. If you can't do that for whatever reason, you will need to convert the strings to a single encoding before you output it on the page, and declare that encoding on the page. Again, I recommend you choose Unicode for your output character set. – DaveRandom Apr 23 '13 at 09:25
  • I agree with @DaveRandom. You can add a new column which tells what encoding to use. However, to fill this column, you need to do this manually (or at least manually verify). – Voitcus Apr 23 '13 at 09:26
  • Did you set UTF-8 Encoding when you had added those data from form into db? – Jenson M John Apr 23 '13 at 09:27
  • Recommended reading: http://kunststube.net/encoding/ http://kunststube.net/frontback/ – DaveRandom Apr 23 '13 at 09:27
  • There're many things to configure in order to use UTF-8. The `` tag is possibly the most irrelevant one. How exactly do you verify what the actual contents of the DB are? Are you using a MySQL client such as Workbench or HeidiSQL? – Álvaro González Apr 23 '13 at 09:30
  • @ÁlvaroG.Vicario Workbench – Paul Stanley Apr 23 '13 at 09:31
  • Then, if `İ` actually gets stored as `İ`, you've most likely forgotten to set the connection encoding in whatever your DB class is. You first need to ensure you **store** data properly. Displaying it comes afterwards. Please read the "front to back" link by DaveRandom. – Álvaro González Apr 23 '13 at 09:40
  • `İ` is the ISO-8859-1 representation of `0xC4 0xB0`, which is `İ` if interpreted as UTF-8. In short, the bytes are right, but the interpretation is wrong. – cmbuckley Apr 23 '13 at 10:01

4 Answers4

16

Your HTML output needs to be in a single encoding, there is no way around that. This means that content in different encodings needs to be converted to your HTML encoding first. While that is possible to do with iconv or mb_convert_encoding, there are two problems you have to solve:

  1. You need to know (or guess) the current encoding of the content
  2. You need to do this manually, everywhere

For example, a theoretical solution would be to pick UTF-8 as your HTML encoding and then do this for all strings you are going to output:

$string = '...'; // from the database

// If it's not already UTF-8, convert to it
if (mb_detect_encoding($string, 'utf-8', true) === false) {
    $string = mb_convert_encoding($string, 'utf-8', 'iso-8859-1');
}

echo $string;

The code above assumes that non-UTF-8 content is encoded in latin-1, which is reasonable according to your question.

Jon
  • 428,835
  • 81
  • 738
  • 806
  • 1
    Winner. Does the job just right. Many thanks. Extra credit : Is there some kind of check I can do before this? Maybe check for non alphanumeric characters or something like that. – Paul Stanley Apr 23 '13 at 09:53
  • @Octopi: Check in order to detect what? – Jon Apr 23 '13 at 10:03
  • you can compare the original string with the converted one, if they are different then there was a special character in the string. – fellowworldcitizen Sep 23 '20 at 21:32
2

Maybe you should choose the utf8 as the connection character set which will retrieve the characters right. The default one might be not right for your required characters.

More details here mysql_set_charset

Miro Markaravanes
  • 3,285
  • 25
  • 32
1

You have to collate 3 things in this case. Almost does not matter what is the character coding of a DB table's content, because in MySQL you can set the character coding of the communication between the DB server and your PHP script. See http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html If you use SET NAMES / SET CHARACTER SET the right way, you can set the communication as to get UTF-8 characters anyway.

You need to check the "physical" (byte-level) character coding of your PHP script file. Set it to UTF-8 in the text editor / IDE whichever you use.

You need to use the appropriate HTML header, you wrote it correctly above.

If all things match properly, the result should be alright.

The only possible trouble, when the textual content in the DB table have been stored with a incorrect char coding.

Adam Solymos
  • 99
  • 1
  • 5
1

I know this is an old post but in case something comes across this issue, here are what I did to solve the problem.

1) export table(s) to sql

2) open sql with notepad++ or other editor

3) copy all then paste it to a new file with BOM (or notepad and save as unicode)

4) I have this on my exported file:

   /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
   /*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
   /*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
   /*!40101 SET NAMES latin1 */;

which I change SET NAMES from latin1 to utf8

   /*!40101 SET NAMES utf8 */;

if you don't have this line just simply add this new line and from

CREATE TABLE IF NOT EXISTS `table_name` (
  // column names....
) ENGINE=MyISAM AUTO_INCREMENT=301 DEFAULT CHARSET=latin1;

change

DEFAULT CHARSET=latin1;

to

DEFAULT CHARSET=utf8;

delete the old tables (backup old tables of course) and import this new file.

It worked for me. Hope that helps.

Michael Eugene Yuen
  • 2,470
  • 2
  • 17
  • 19