1

MySQL Database returns utf8 encoded text. Basically, I used PDO attribute MYSQL_ATTR_INIT_COMMAND and passed:

SET CHARACTER SET utf8

It returns utf8 encoded text. But some text in the database is plain utf8, something like &alum; are returned as is.

So I need to call utf8_encode again in php to get the actual utf8 char. Its working fine.

I would like to know, if it will have any negative effect encoding the text twice or it does not affect anything other than encoding the non-encoded text like above?

Thanks!

Edit:

I am using the following code to get the right characters:

 $val = utf8_encode(addslashes(html_entity_decode(strip_tags($val))));

So what it does is convert the following text from:

<font color=\"#222222\" face=\"arial, sans-serif\" size=\"2\"> Test Event  &nbsp; &nbsp;</font><span style=\"color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 13px;\">Pers&ouml;nlichkeit Universit&auml;t&quot;</span>

(This text is coming from the database, after calling the SET CHARACTER SET utf8)

to:

Test Event Persönlichkeit Universität\"
Kevin Rave
  • 13,876
  • 35
  • 109
  • 173
  • 1
    I can't understand a word of your question. For some reason, everybody seems to think that `utf8_encode()` is a magic function that automatically fixes any encoding issue ever. It isn't, is just converts from ISO-8859-1 to UTF-8. `&alum;` is an HTML entity. All those chars (&-a-l-u-m;) are the same in ISO-8859-1 and UTF-8 so `utf8_encode()` does absolutely nothing. Which is not that bad—in other cases it'll just corrupt your data. – Álvaro González Apr 18 '13 at 16:30
  • Thats right. I think I need to be more detailed there. I am going to edit the question – Kevin Rave Apr 18 '13 at 16:34
  • No. I just wanted to know if there will be any negative effect on utf8 encoding the text twice. Plain and simple. – Kevin Rave Apr 18 '13 at 16:42
  • 1
    Running it once will already corrupt your data, so running it twice will corrupt it even more. Test with e.g. an € symbol. – Álvaro González Apr 18 '13 at 16:44

1 Answers1

2

&auml; is a html entity that probably shouldn't have made it to your database in the first place. It has nothing to do with UTF-8.

If you call utf8_encode on "&auml;" nothing will happen as the encoding is the same for ISO-8859-1 and UTF-8. You will see the character it represents in browser because it is interpreted as html.

You should never, as a normal web app developer, call utf8_encode. You don't actually need ISO-8859-1 to UTF-8 conversion, firstly because browsers and MySQL do not support it. They alias Latin1 and ISO-8859-1 to Windows-1252 for compatibility. Secondly, you can cause browsers and database to send their data in UTF-8 so it is already UTF-8 and no conversion is necessary.

You shouldn't convert to html entities either - it is unnecessary because UTF-8 can represent all characters.

The data in database should not have any concern about html - the data there should be canonical authorative as-is representation of data. Right now there is confusion whether the data is actually literally meant to be &auml; or ä which causes problems like this:

enter image description here

Image from TheDailyWTF

Esailija
  • 138,174
  • 23
  • 272
  • 326