147

I'm working with a MySQL database that has some data imported from Excel. The data contains non-ASCII characters (em dashes, etc.) as well as hidden carriage returns or line feeds. Is there a way to find these records using MySQL?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Ed Mays
  • 1,730
  • 4
  • 13
  • 12

10 Answers10

298

MySQL provides comprehensive character set management that can help with this kind of problem.

SELECT whatever
  FROM tableName 
 WHERE columnToCheck <> CONVERT(columnToCheck USING ASCII)

The CONVERT(col USING charset) function turns the unconvertable characters into replacement characters. Then, the converted and unconverted text will be unequal.

See this for more discussion. https://dev.mysql.com/doc/refman/8.0/en/charset-repertoire.html

You can use any character set name you wish in place of ASCII. For example, if you want to find out which characters won't render correctly in code page 1257 (Lithuanian, Latvian, Estonian) use CONVERT(columnToCheck USING cp1257)

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • 24
    This is an excellent solution to this problem and much more robust. – CraigDouglas Oct 23 '12 at 14:33
  • 6
    this is also useful to find characters with accents (á ä etc) or character not belonging to encoding – Glasnhost Oct 22 '14 at 12:34
  • 3
    much better than using REGEXP (which doesn't seem to work for me for finding accents) and also provides a simple mechanism for making everything ascii again... – D2TheC Nov 14 '17 at 08:47
  • 1
    This answer works wonderfully and will bring up strings that contain any non-ASCII characters rather than just strings that contain *only* non-ASCII characters. Thank you! – techjp Sep 11 '18 at 07:26
  • 3
    Outstanding solution! – Mad Dog Tannen Nov 07 '18 at 10:06
94

You can define ASCII as all characters that have a decimal value of 0 - 127 (0x00 - 0x7F) and find columns with non-ASCII characters using the following query

SELECT * FROM TABLE WHERE NOT HEX(COLUMN) REGEXP '^([0-7][0-9A-F])*$';

This was the most comprehensive query I could come up with.

Nefrubyr
  • 6,936
  • 2
  • 29
  • 20
zende
  • 1,037
  • 1
  • 8
  • 5
  • 4
    Best answer so far, but it's even easier like this : `SELECT * FROM table WHERE LENGTH( column ) != CHAR_LENGTH( column )` – SuN Nov 28 '12 at 18:53
  • 16
    -1 **This can yield erroneous results.** Suppose, for example, that one has a UTF-16 column containing `'ā'` (encoded by the byte sequence `0x0101`) - it would be deemed "ASCII" using this test: *a false negative*; indeed, some character sets do not encode ASCII characters within `0x00` to `0x7f` whereupon this solution would yield a false positive. **DO NOT RELY UPON THIS ANSWER!** – eggyal Feb 11 '14 at 09:13
  • 2
    @sun: That doesn't help at all - many character sets are fixed-length and so `LENGTH(column)` will be a constant multiple of `CHAR_LENGTH(column)` irrespective of the value. – eggyal Feb 11 '14 at 09:39
74

It depends exactly what you're defining as "ASCII", but I would suggest trying a variant of a query like this:

SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9]';

That query will return all rows where columnToCheck contains any non-alphanumeric characters. If you have other characters that are acceptable, add them to the character class in the regular expression. For example, if periods, commas, and hyphens are OK, change the query to:

SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9.,-]';

The most relevant page of the MySQL documentation is probably 12.5.2 Regular Expressions.

Braiam
  • 1
  • 11
  • 47
  • 78
Chad Birch
  • 73,098
  • 23
  • 151
  • 149
  • Thanks - I will take a look at that. I don't have much experience with Regular Expressions in SQL, so this will be a good opportunity to learn. – Ed Mays Dec 31 '08 at 03:47
  • 3
    Shouldn't you escape the hyphen and period? (Since they do have special meanings in a regular expression.) SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9\.,\-]'; – Tooony Mar 12 '09 at 16:03
  • 3
    @Tooony No, inside of a set, a period just means itself and the dash only has special meaning between other characters. At the end of the set, it means only itself. – Michael Speer Jan 25 '12 at 21:55
  • 12
    This query only finds all lines in tableName that do not contain an alphanumeric character. This does not answer the question. – Rob Bailey Aug 26 '13 at 17:26
  • Note the **warning** in the [documentation](https://dev.mysql.com/doc/en/regexp.html): "*The `REGEXP` and `RLIKE` operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.*" – eggyal Feb 11 '14 at 09:35
  • 12
    That is for columns that don't have any ascii characters at all, so it will miss those with a mix of ascii and non-ascii characters. The answer below from zende checks for one or more non-ascii characters. This helped me for the most part `SELECT * FROM tbl WHERE colname NOT REGEXP '^[A-Za-z0-9\.,@&\(\) \-]*$';` – Frank Forte Oct 16 '15 at 21:00
  • Frank above is right. You need the "^" at the beginning of the REGEXP to find any chars NOT in the set. Then eliminate the "NOT" in the query. So (in addition to the better answers below) something like this works: SELECT ... WHERE colname REGEXP '[^A-Za-z0-9.,-]'; – IcarusNM Mar 17 '16 at 16:46
  • Icarus's solution worked for me. In my opinion Chad's answer is not a correct answer, even though it was selected as correct. – Slam Aug 09 '17 at 16:53
  • 2
    This only works (for me anyway) to find strings that contain NONE of those characters. It does not find strings that contain a mix of ASCII and non-ASCII characters. – techjp Sep 11 '18 at 07:24
54

This is probably what you're looking for:

select * from TABLE where COLUMN regexp '[^ -~]';

It should return all rows where COLUMN contains non-ASCII characters (or non-printable ASCII characters such as newline).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
  • 7
    Works great for me. "regexp '[^ -~]'" means has a character that is before space " " or after "~" or ASCII 32 - 126. All letters, numbers, and symbols, but no unprintable things. – Josh Dec 11 '12 at 02:18
  • You can even get it as a tee-shirt ;) http://www.catonmat.net/blog/my-favorite-regex/ – SamGoody Feb 07 '13 at 20:52
  • 2
    Note the **warning** in the [documentation](https://dev.mysql.com/doc/en/regexp.html): "*The `REGEXP` and `RLIKE` operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.*" – eggyal Feb 11 '14 at 09:36
  • 1
    thanks for this. what i'm wondering is how to replace a replacement character - e.g. â�� – mars-o Apr 15 '14 at 14:58
  • space is first printable, ~ last printable, ^ means negate. This regexp looks a bit odd because a space is the beginning of the range – zzapper Jan 27 '15 at 11:52
  • 1
    @mars-o - the black diamond indicates an invalid utf8 character. More discussion [_here_](http://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored) – Rick James Apr 08 '17 at 19:57
  • @eggyal This works perfectly fine in utf8 at least, because utf8 preserves ASCII. I wouldn't trust it in utf16. – Brilliand Feb 19 '20 at 22:51
  • Works great, but includes line breaks – Chris May 22 '20 at 13:40
15

One missing character from everyone's examples above is the termination character (\0). This is invisible to the MySQL console output and is not discoverable by any of the queries heretofore mentioned. The query to find it is simply:

select * from TABLE where COLUMN like '%\0%';
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Rob Bailey
  • 1,747
  • 15
  • 18
4

Based on the correct answer, but taking into account ASCII control characters as well, the solution that worked for me is this:

SELECT * FROM `table` WHERE NOT `field` REGEXP  "[\\x00-\\xFF]|^$";

It does the same thing: searches for violations of the ASCII range in a column, but lets you search for control characters too, since it uses hexadecimal notation for code points. Since there is no comparison or conversion (unlike @Ollie's answer), this should be significantly faster, too. (Especially if MySQL does early-termination on the regex query, which it definitely should.)

It also avoids returning fields that are zero-length. If you want a slightly-longer version that might perform better, you can use this instead:

SELECT * FROM `table` WHERE `field` <> "" AND NOT `field` REGEXP  "[\\x00-\\xFF]";

It does a separate check for length to avoid zero-length results, without considering them for a regex pass. Depending on the number of zero-length entries you have, this could be significantly faster.

Note that if your default character set is something bizarre where 0x00-0xFF don't map to the same values as ASCII (is there such a character set in existence anywhere?), this would return a false positive. Otherwise, enjoy!

Mahmoud Al-Qudsi
  • 28,357
  • 12
  • 85
  • 125
  • 1
    00-FF includes all possible 8-bit values, which is what `REGEXP` is checking. Hence it is guaranteed to always match. Also `^$` is probably not what you wanted. – Rick James Apr 08 '17 at 19:55
  • Definitely the best REGEXP solution for finding all 8 bit characters but not as good as the CONVERT(col USING charset) solution which also will allows control characters while limiting display characters to a specific charset. – techjp Dec 25 '18 at 20:07
2

Try Using this query for searching special character records

SELECT *
FROM tableName
WHERE fieldName REGEXP '[^a-zA-Z0-9@:. \'\-`,\&]'
Sachin
  • 36
  • 4
1

@zende's answer was the only one that covered columns with a mix of ascii and non ascii characters, but it also had that problematic hex thing. I used this:

SELECT * FROM `table` WHERE NOT `column` REGEXP '^[ -~]+$' AND `column` !=''
chiliNUT
  • 18,989
  • 14
  • 66
  • 106
-1

In Oracle we can use below.

SELECT * FROM TABLE_A WHERE ASCIISTR(COLUMN_A) <> COLUMN_A;
-2

for this question we can also use this method :

Question from sql zoo:
Find all details of the prize won by PETER GRÜNBERG

Non-ASCII characters

ans: select*from nobel where winner like'P% GR%_%berg';

Hemen_boro
  • 5
  • 1
  • 2