2

I am trying to select all rows that have been mangled in our database and contain non-UTF8 characters ... Is this best by a regex?

Currently, I have tried "like '%Ã%'", which works fairly well, but not 100% by a long way. This regex isn't great as it pulls back all of our successfully 'translated back into utf8' characters as well as spaces etc (REGEXP '(\S+[^A-Za-z0-9]+)'"). Although the latter are easy enough to get out, am not sure if regex the best route.

Example rows not being selected included characters such as "dié", "yücel" and "Gråberg".

Thanks

elyob
  • 43
  • 5

1 Answers1

0

I'm not sure this can (or perhaps should) be phrased in a SQL query. You can merely iterate through the entire database, and see if each row is valid UTF-8 data, outputting the keys to the rows that are not. I'm not sure if PHP has a ready-made function for seeing if "these bytes are valid UTF-8", but Python does - if you know that, you might consider it. I at one point wrote a program to do just this, as I was having similar problems. (I do not have the source code, sorry.)

One thing to be aware of: it is possible, but unlikely, that a string, accidentally stored and encoded as ISO-8859-1, is also a valid UTF-8 string.

I'm not aware of how MySQL handles strings, but does MySQL allow you to insert a non-UTF-8 string into a UTF-8 database? (PostgreSQL, for example, will not allow you to do this on UTF-8 databases.)

Thanatos
  • 42,585
  • 14
  • 91
  • 146
  • Thanks, yes. I am now just selecting out the entire database and using PHP to show all UTF8 rows. This does mean a lot more data than I want, but does make sure I see every single row that could be in error. It was the FixUTF8 script from here that I used .. http://stackoverflow.com/questions/1401317/remove-non-utf8-characters-from-string – elyob Feb 09 '11 at 10:41