Selecting all rows with non-UTF8 characters

Question

I am trying to select all rows that have been mangled in our database and contain non-UTF8 characters ... Is this best by a regex?

Currently, I have tried "like '%Ã%'", which works fairly well, but not 100% by a long way. This regex isn't great as it pulls back all of our successfully 'translated back into utf8' characters as well as spaces etc (REGEXP '(\S+[^A-Za-z0-9]+)'"). Although the latter are easy enough to get out, am not sure if regex the best route.

Example rows not being selected included characters such as "diÃ©", "yÃ¼cel" and "GrÃ¥berg".

Thanks

score 0 · Accepted Answer · answered Feb 08 '11 at 19:33

I'm not sure this can (or perhaps should) be phrased in a SQL query. You can merely iterate through the entire database, and see if each row is valid UTF-8 data, outputting the keys to the rows that are not. I'm not sure if PHP has a ready-made function for seeing if "these bytes are valid UTF-8", but Python does - if you know that, you might consider it. I at one point wrote a program to do just this, as I was having similar problems. (I do not have the source code, sorry.)

One thing to be aware of: it is possible, but unlikely, that a string, accidentally stored and encoded as ISO-8859-1, is also a valid UTF-8 string.

I'm not aware of how MySQL handles strings, but does MySQL allow you to insert a non-UTF-8 string into a UTF-8 database? (PostgreSQL, for example, will not allow you to do this on UTF-8 databases.)

Thanks, yes. I am now just selecting out the entire database and using PHP to show all UTF8 rows. This does mean a lot more data than I want, but does make sure I see every single row that could be in error. It was the FixUTF8 script from here that I used .. http://stackoverflow.com/questions/1401317/remove-non-utf8-characters-from-string — elyob, Feb 09 '11 at 10:41

Selecting all rows with non-UTF8 characters

1 Answers1