3

Related question to How can I find non-ASCII characters in MySQL?.

I want to check for col1 and col2 in my table given below for the cases where non keyboard characters are present.

+------------+----------+
|    col1    |    col2  |
+------------+----------+
| rewweew\s  |  4rtrt   |
| é          |  é       |
| 123/       |  h|h     |
| ëû         |  û       |
| ¼          |  ¼       |
| *&^        |  *%$     |
| #$         |  ~!`     |
+------------+----------+

My desired result will look like

 +--------+-------+
 |   é    |   é   |
 |   ëû   |   û   |
 |   ¼    |   ¼   |
 +--------+-------+

In my case all characters present in English keyboard are allowed, i have to only find out row which have character not present in English keyboard like Chinese character etc.

I got the below mentioned query from the link How can I find non-ASCII characters in MySQL?

SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9.,-]';

But its not working because character ~`@!#$%^&*()_-+=|}]{[':;?/>.<, are also allowed but its neglecting them.

Community
  • 1
  • 1
Ankit Sharma
  • 3,923
  • 2
  • 29
  • 49
  • Which keyboard are you talking about? Chinese keyboards will have chinese characters; mine has weird stuff like `ÄÖÜß` on it. – Pekka Jul 31 '12 at 12:03
  • Except for a few exceptions, the English keyboard is largely identical with the ASCII character set, so the question you link to should work? – Pekka Jul 31 '12 at 12:04
  • @Pekka the solution present to that question is not working – Ankit Sharma Jul 31 '12 at 12:05
  • Do include that information in your question! Otherwise, people will close as duplicate. – Pekka Jul 31 '12 at 12:06
  • You have to add the characters you want to allow to the regex. – JJJ Jul 31 '12 at 12:22
  • the character list i am allowing outside alphanumeric characters are ~`@!#$%^&*()_-+=|}]{[':;?/>.<, which are creating problem when adding to reexp – Ankit Sharma Jul 31 '12 at 12:26
  • I'm thinking a clever way to do this might be to force cast the column into ASCII, and to compare it with the original string. Not sure whether that is possible in mySQL though, and can't find out right now – Pekka Jul 31 '12 at 12:27
  • @Pekka `Ollie Jones` have done what you are saying – Ankit Sharma Jul 31 '12 at 13:25

2 Answers2

11

This may be worth a try.

SELECT whatever
  FROM tableName 
 WHERE columnToCheck <> CONVERT(columnToCheck USING ASCII)

The CONVERT(col USING charset) function will turns the unconvertable characters into replacement characters. Then, the converted and unconverted text will be unequal.

Of course it's based on what is and isn't in the ASCII character repertoire, not what's on a particular keyboard. But it should probably do the trick for you. See this for more discussion. http://dev.mysql.com/doc/refman/5.0/en/charset-repertoire.html

You can use any character set name you wish in place of ASCII. For example, if you want to find out which characters won't render correctly in code page 1257 (Lithuanian, Latvian, Estonian) use CONVERT(columnToCheck USING cp1257)

Edit Your comment mentioned that you need also to detect some characters that are in the ASCII character set. I think you're asking about the so-called control characters, which have values from 0x00 to 0x1f, and then 0x7f. @Joni Salonen's approach helps get us there, but we need to do it in a way that's multibyte-character safe.

 SELECT whatever
   FROM tableName
  WHERE CONVERT(columnToCheck USING ASCII) <> columnToCheck
     OR CONVERT(columnToCheck USING ASCII) RLIKE '[[.NUL.]-[.US.][.DEL.]]'

If you look at http://www.asciitable.com/, you'll see that the OR clause here detects characters in the first column of the ASCII table, and the last character in the fourth column.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • There is one problem with your query, i want to filter out records which have backslash,new tab etc but those cannot be find out by your query. Specially the character '' is also not detectable with your query – Ankit Sharma Aug 01 '12 at 07:36
  • Thanx again for fixing the querry...now i can apply it on live data – Ankit Sharma Aug 02 '12 at 12:28
  • Could you explain what you say about "multibyte-character safe"?Also, the query in your edit lets through any row where the column contains a control character even if it also contains non-ASCII characters. – Joni Mar 22 '13 at 11:12
3

This query will return the rows that have characters outside of the ASCII range 0 - 127:

SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '^[[.NUL.]-[.DEL.]]*$'

By English keyboard do you mean American or UK keyboard? The UK keyboard includes some non-ASCII characters, like the sterling pound symbol. If you want to accept those too you have to add them to the regular expression.

Joni
  • 108,737
  • 14
  • 143
  • 193