0

I have a database table called "tweets". I have downloaded tweets using the Twitter Search API. The database table collation is set to latin1_swedish_ci, as advised by MySQL, since it holds up for at least the Enlish language (I read this on the MySQL support pages somewhere).

Anyway, I now see a lot of tweets looking like this:

$S&P news: Bank of America’s Mortgage-Bond Accord http://bit.ly/oTXC5a
@LucciAlerts >> $BAC from a pincher play setup
ë¯¸êµ­ì‹ ìš©ë“±ê¸‰ì´ë–¨ì–´ì¡Œë„¤ RT @CNBC  RT @alexcrippen: S&P affirms AA+
  • I believe & and the likes can be fixed by using (PHP) htmlspecialchars_decode() to translate them to original characters;
  • But I don't know how to fix "America’s", for instance. Obviously, ’ should be an apostophe ('), but how do I get it back?
  • Finally, there are some people who like to put all sorts of ASCII characters in their tweets (stars, "real" smileys instead of emoticons). Those have been stored as "ë¯¸êµ­ì‹ ìš©ë“±ê¸‰ì´ë–¨ì–´ì¡Œë„¤", like above. Is there a way to fix this and if so, how?

Any help is greatly appreciated!

Pr0no
  • 3,910
  • 21
  • 74
  • 121
  • 1
    recommend a quick switch from latin1_swedish_ci to utf8 – Mark Baker Feb 08 '12 at 14:14
  • 2
    Set your charset to `utf8` and try again. – Leigh Feb 08 '12 at 14:14
  • See, for instance, https://api.twitter.com/1/statuses/show.json?id=38041113268854784. The Tweet text content ("text") given by the API is: "$\u00aa\u00aa \u264f\u00aa\u00aa ^ $\u00aa\u00aa \u264f\u00aa\u00aa RT nadiaberod: makasih (\u007B\u007D) RT Lsa_fabulous: ThRT nadiaberod: anind_aurelly Lsa_fabulous denyciouss @nuruuuuuuuul ??". How then, do you get the original characters? Passing it through utf8_decode() doesn;t help... – Pr0no Feb 08 '12 at 14:53
  • 1
    @Reveller: [Check out this question](http://stackoverflow.com/questions/2934563/how-to-decode-unicode-escape-sequences-like-u00ed-to-proper-utf-8-encoded-cha) – Leigh Feb 08 '12 at 14:59

1 Answers1

3

This looks like utf-8 double encoding in iso-8859. I've seen this many times, but never so f**ed up like this.

Fix your application

To hold all those special characters received from twitter API, your database should be in utf8 collation, not latin1. using latin1_swedish_ci was a very bad advice.

Also, the database connection of your PHP application may not be correctly set up to use utf8.

read on here: http://www.php.net/manual/en/function.mysql-set-charset.php the comments in the PHP manual are always worth reading.

if you use zend or some else library, usually there is a supported way to tell PDO or whatever database layer to use utf8 in the db connection.

If you can, drop all content and start populating the database again after converting the collation and your client character set tu utf8.

Fix the data

If you can not drop all the content, you have to fix the ascii representation of your double or triple encoded utf8 strings directly in the database.

"fixing" this data up is a complicated task .. it's easiest to retrieve a dump of it using a latin1 collation database connection and writing it back using a utf-8 database connection. but in your case, the data seems to be triple- or quadruple-encoded in UTF-8.

read on here: http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/

protip: before you start playing with your data store two dumps of it aside. one with iso8859 client charset and another with utf8 client charset.

Now you can try to fix it up. Also try to convert the database table to UTF8 using mysql CONVERT before you do the procedure described in the linked blog post.

Kaii
  • 20,122
  • 3
  • 38
  • 60