0

Possible Duplicate:
How to fix double-encoded UTF8 characters (in an utf-8 table)

I see extra characters like â showing because of encoding issues as I found out here - HTML encoding issues - "Â" character showing up instead of " "

I understand that if I set the browser meta encoding to UTF-8, these will not affect anything but I need to strip these extra characters from the database for other purposes.

For eg. :

Text: ↑ should be become Text: ↑

But if I run it through utf8_decode it gives me Text: �??

For every other occurrence of the â character, it converts properly to a blank space. Any help will be appreciated.

Community
  • 1
  • 1
cowboybebop
  • 2,225
  • 3
  • 21
  • 31
  • 1
    You don't need to "strip" the characters, you simply need to handle your strings in the right encoding. Read [Handling Unicode Front To Back In A Web App](http://kunststube.net/frontback/). – deceze Dec 14 '12 at 17:03
  • @deceze The problem is we moved from a system which did not have utf-8 encoding in the database to one which does. As a result old values have an extra character attached to them which newer values no longer do. So if i happen to compare two such records, a mismatch occurs, – cowboybebop Dec 14 '12 at 17:07
  • Are your values in the database actually broken and they have that extra character hardcoded now, or are you simply not treating the values in their correct encoding? – deceze Dec 14 '12 at 17:09
  • @deceze, values in the database are broken. – cowboybebop Dec 14 '12 at 20:02
  • +1 to @hakre for pointing to the solution – cowboybebop Dec 14 '12 at 20:02
  • It looks like that it was originally UTF-8 but you then imported it as Windows-1252 into UTF-8. – hakre Dec 14 '12 at 20:10

1 Answers1

1

You have not shared much information in your question, but according to the sample you gave:

↑ (has been imported as) ↑

This looks like you had already stored it as UTF-8 into the export file but while importing you told the file would be Windows-1252 encoded. It then was re-encoded a second time into UTF-8.

↑                                 UTF8: \xE2\x86\x91    UPWARDS ARROW (U+2191)

â  - Windows 1252     \xE2 226    UTF8: \xC3\xA2        LATIN SMALL LETTER A WITH CIRCUMFLEX (U+00E2)
†  - Windows 1252     \x86 134    UTF8: \xE2\x80\xA0    DAGGER (U+2020)
‘  - Windows 1252     \x91 145    UTF8: \xE2\x80\x98    LEFT SINGLE QUOTATION MARK (U+2018)

In MySQL the name of the Windows 1252 character set is latin1 (cp1252 West European, specific differences documented), for a full list please see Character Sets and Collations That MySQL Supports.

That is why the solution in the related Q&A works.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836