PHP string enocding issue - Extra characters appearing

Question

Possible Duplicate:
How to fix double-encoded UTF8 characters (in an utf-8 table)

I see extra characters like â showing because of encoding issues as I found out here - HTML encoding issues - "Â" character showing up instead of " "

I understand that if I set the browser meta encoding to UTF-8, these will not affect anything but I need to strip these extra characters from the database for other purposes.

For eg. :

Text: â†‘ should be become Text: ↑

But if I run it through utf8_decode it gives me Text: �??

For every other occurrence of the â character, it converts properly to a blank space. Any help will be appreciated.

You don't need to "strip" the characters, you simply need to handle your strings in the right encoding. Read [Handling Unicode Front To Back In A Web App](http://kunststube.net/frontback/). — deceze, Dec 14 '12 at 17:03
@deceze The problem is we moved from a system which did not have utf-8 encoding in the database to one which does. As a result old values have an extra character attached to them which newer values no longer do. So if i happen to compare two such records, a mismatch occurs, — cowboybebop, Dec 14 '12 at 17:07
Are your values in the database actually broken and they have that extra character hardcoded now, or are you simply not treating the values in their correct encoding? — deceze, Dec 14 '12 at 17:09
It looks like that it was originally UTF-8 but you then imported it as Windows-1252 into UTF-8. — hakre, Dec 14 '12 at 20:10

score 1 · Accepted Answer · edited May 23 '17 at 12:21

You have not shared much information in your question, but according to the sample you gave:

↑ (has been imported as) â†‘

This looks like you had already stored it as UTF-8 into the export file but while importing you told the file would be Windows-1252 encoded. It then was re-encoded a second time into UTF-8.

↑                                 UTF8: \xE2\x86\x91    UPWARDS ARROW (U+2191)

â  - Windows 1252     \xE2 226    UTF8: \xC3\xA2        LATIN SMALL LETTER A WITH CIRCUMFLEX (U+00E2)
†  - Windows 1252     \x86 134    UTF8: \xE2\x80\xA0    DAGGER (U+2020)
‘  - Windows 1252     \x91 145    UTF8: \xE2\x80\x98    LEFT SINGLE QUOTATION MARK (U+2018)

In MySQL the name of the Windows 1252 character set is latin1 (cp1252 West European, specific differences documented), for a full list please see Character Sets and Collations That MySQL Supports.

That is why the solution in the related Q&A works.

PHP string enocding issue - Extra characters appearing

1 Answers1