0

In my database we have fields where the data is not readable. I now know why it happened but I don't know how to fix it.

I found a way to get the info back from the database:

SELECT id,
       name 
  FROM projects 
 WHERE LENGTH(name) != CHAR_LENGTH(name);

One of the rows returned shows:

id   | name
-------------------------
1008 | Cajón el Diablo

This should be:

id   | name
-------------------------
1008 | Cajón el Diablo

Can somebody help me figure out how to fix this problem? How can I convert this using SQL? Is SQL not good? If not, how about Python?

OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
  • What is the collation (character set) of the `PROJECTS.name` column? Easy way to check is to run: `SHOW FULL COLUMNS FROM PROJECTS LIKE 'name'` - "collation" should be the 3rd column. http://dev.mysql.com/doc/refman/5.1/en/show-columns.html – OMG Ponies Mar 14 '10 at 18:00
  • @OMG Ponies: you mean "character set", not "collation". Character set defines how characters are represented with bytes, whereas collation defines character sorting order. – Piskvor left the building Mar 14 '10 at 18:40
  • In the collation column, it shows 'utf8_bin.' –  Mar 16 '10 at 02:03

1 Answers1

1

Your mySQL data is most likely UTF-8 encoded.

The tool or client you are viewing the data with is either

  • Not talking to the mySQL server in UTF-8 (SET NAMES utf8)

  • Outputting UTF-8 characters in an environment that has an encoding different from UTF-8 (e.g. a web page encoded in ISO-8859-1).

You need to either specify the correct character set when connecting to the mySQL database, or convert the incoming characters so they can be output correctly.

For more information, you would have to tell us what collation your database and tables is in, and what you are using to look at the data.

If you want to get into the basics of this, this is very good reading: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • Thank you Pekka. The issue I think is this: the application to the database didn't handle the data correctly and so, the data in the table is actually latin1 but the database thinks it is utf8. –  Mar 16 '10 at 19:14
  • @Eric aww, that sucks. I don't think that can be solved within the database (except of course by running a series of `set field = REPLACE(field, 'ó', 'ó')` operations.) Do you have access to a scripting language? In PHP, a `utf8_decode()` would do the trick I think. – Pekka Mar 16 '10 at 19:22
  • I think you are right. So, I guess I will try to find some creative way of getting this one. I can use Python. Thanks Pekka. –  Mar 17 '10 at 21:52
  • @Eric all right. Not directly related but maybe an inspiration: http://stackoverflow.com/questions/1177316/decoding-double-encoded-utf8-in-python – Pekka Mar 17 '10 at 22:37