0

I'm subscribing to an XML feed to populate some data in a webpage. The charset is UTF8 in both the database and is set to UTF8 in the meta tags of the actual pages.

However, when I publish the feed data, it comes out with odd characters like:

’ instead of '.

I realize that the feed is probably using a non-UTF8 encoding for the text. However, I don't know how to determine that-- and the next feed I look at may have yet a different encoding.

How do I ensure that the data coming from the feed is correctly encoded as UTF8 before getting stored in the DB?

Thanks

user101289
  • 9,888
  • 15
  • 81
  • 148
  • you could try replacing apostrophes with `'` that worked out for me – iam-decoder Jul 14 '15 at 17:41
  • The following Q&A might be interesting for you: [How to detect malformed utf-8 string in PHP?](http://stackoverflow.com/q/6723562/367456). – hakre Jul 15 '15 at 18:12

1 Answers1

1

How do I ensure that the data coming from the feed is correctly encoded as UTF8 before getting stored in the DB?

Write it to a file and view it in a web browser—or just view the feed address directly in a web browser. If you see ’ in the web browser then the feed is simply mis-encoded.

The character (U+2019 Right Single Quotation Mark) in UTF-8 is the byte sequence 0xE2, 0x80, 0x99, which if mis-interpreted as Windows code page 1252 comes out as ’. In principle to reverse the damage you could try encoding your extracted text as cp1252 and re-interpreting it as UTF-8:

iconv('utf-8', 'windows-1252', $dodgy_str)

This works for but if there are bytes in the UTF-8 encoding that can't be encoded to cp1252 then the original content for those sequences is unrecoverable. A much better approach would be to contact whoever is providing the faulty feed to get them to fix it.

If, on the other hand, the browser renders it OK, the problem lies somewhere in your parsing of the XML or connection to the database.

bobince
  • 528,062
  • 107
  • 651
  • 834