42

I have a table with a couple thousand rows. The description and summary fields are NTEXT, and sometimes have non-ASCII chars in them. How can I locate all of the rows with non ASCII characters?

TheSoftwareJedi
  • 34,421
  • 21
  • 109
  • 151

9 Answers9

66

I have sometimes been using this "cast" statement to find "strange" chars

select 
    *
from 
    <Table>
where 
    <Field> != cast(<Field> as varchar(1000))
Kong
  • 8,792
  • 15
  • 68
  • 98
CC1960
  • 685
  • 5
  • 2
18

First build a string with all the characters you're not interested in (the example uses the 0x20 - 0x7F range, or 7 bits without the control characters.) Each character is prefixed with |, for use in the escape clause later.

-- Start with tab, line feed, carriage return
declare @str varchar(1024)
set @str = '|' + char(9) + '|' + char(10) + '|' + char(13)

-- Add all normal ASCII characters (32 -> 127)
declare @i int
set @i = 32
while @i <= 127
    begin
    -- Uses | to escape, could be any character
    set @str = @str + '|' + char(@i)
    set @i = @i + 1
    end

The next snippet searches for any character that is not in the list. The % matches 0 or more characters. The [] matches one of the characters inside the [], for example [abc] would match either a, b or c. The ^ negates the list, for example [^abc] would match anything that's not a, b, or c.

select *
from yourtable
where yourfield like '%[^' + @str + ']%' escape '|'

The escape character is required because otherwise searching for characters like ], % or _ would mess up the LIKE expression.

Hope this is useful, and thanks to JohnFX's comment on the other answer.

Andomar
  • 232,371
  • 49
  • 380
  • 404
  • You may want to add a few(or all) of the characters below 32 as well, especially important would be Carriage Return(13), Line Feed (10), and Tab (9). – Chris Shaffer Mar 27 '09 at 16:42
  • 1
    I found your technique to be the most reliable way to do this in SQL-Server. – cusman Mar 11 '13 at 02:46
  • We combined this with https://stackoverflow.com/a/2084732/426894 and https://www.decisivedata.net/blog/how-to-find-a-hidden-unicode-character-using-sql-server to quickly identify data with invalid characters with the row id, position and unicode value. Thank you! – asawyer Jul 09 '20 at 14:28
11

Here ya go:

SELECT *
FROM Objects
WHERE 
    ObjectKey LIKE '%[^0-9a-zA-Z !"#$%&''()*+,\-./:;<=>?@\[\^_`{|}~\]\\]%' ESCAPE '\'
stuartd
  • 70,509
  • 14
  • 132
  • 163
petejamd
  • 523
  • 1
  • 10
  • 14
3

It's probably not the best solution, but maybe a query like:

SELECT *
FROM yourTable
WHERE yourTable.yourColumn LIKE '%[^0-9a-zA-Z]%'

Replace the "0-9a-zA-Z" expression with something that captures the full ASCII set (or a subset that your data contains).

Chris Shaffer
  • 32,199
  • 5
  • 49
  • 61
  • Wouldn't this just match rows that contain any ASCII character, as opposed only ASCII characters? – Andomar Mar 26 '09 at 18:38
  • The ^ marker at the front of the expression means NOT, so no. It would get any row that had at least one character that wasn't in the ranges specified. – JohnFx Mar 26 '09 at 19:08
  • How can I put the full ascii set in that expression? it's HTML data that I'm looking at so "/><' etc... is in there. – TheSoftwareJedi Mar 27 '09 at 14:22
  • The answer I placed checks against the full ascii set, it should work with >/< because those get escaped. – Andomar Mar 27 '09 at 14:58
  • @TheSoftwareJedi - you would need to add the characters to the list (eg, '%[^0-9a-zA-Z<>/"'']%'). Andomar's solution programmatically builds a complete list, you could use that. – Chris Shaffer Mar 27 '09 at 16:41
3

Technically, I believe that an NCHAR(1) is a valid ASCII character IF & Only IF UNICODE(@NChar) < 256 and ASCII(@NChar) = UNICODE(@NChar) though that may not be exactly what you intended. Therefore this would be a correct solution:

;With cteNumbers as
(
    Select ROW_NUMBER() Over(Order By c1.object_id) as N
    From sys.system_columns c1, sys.system_columns c2
)
Select Distinct RowID
From YourTable t
    Join cteNumbers n ON n <= Len(CAST(TXT As NVarchar(MAX)))
Where UNICODE(Substring(TXT, n.N, 1)) > 255
    OR UNICODE(Substring(TXT, n.N, 1)) <> ASCII(Substring(TXT, n.N, 1))

This should also be very fast.

RBarryYoung
  • 55,398
  • 14
  • 96
  • 137
  • ASCII is only up to 127. Also your numbers cte is weird - the final solution should use a preexisting numbers table instead of it. Otherwise, this is how I would do it. – Adam A Jun 20 '09 at 12:41
  • 1
    FYI: "Not appearing weird" is not a criteria that I (or most of the uber-geeks on this site) really care about. And the advantage of not using a pre-existing numbers table, is that it works even if said table does not pre-exist (which it usually does not). Finally, if you test my solution, you will find that it performs comparable to and sometimes even better than a pre-existing numbers table. – RBarryYoung Jan 01 '12 at 22:51
  • I don't really want to argue - I mostly agree with your solution. But "weird" is usually hard to read and unmaintainable. That's how I meant it. It's also not DRY if you ever have two different queries that need numbers - you have to write the unmaintainable cte twice. Reading row numbers from a join of system_columns on itself is not very straightforward, but reading numbers from a numbers table would be very easy to debug. – Adam A Feb 27 '12 at 21:55
  • I tried the above but found it to be **much** slower then CC1960's answer (5:34 compared to CC1960's 0:09). I did rewrite the query to move the "with" table inline: `SELECT DISTINCT FROM t JOIN (SELECT ROW_NUMBER() OVER(ORDER BY c1.object_id) as rNum From sys.system_columns c1, sys.system_columns c2) AS n ON n.rNum <= Len() Where UNICODE(Substring(, n.rNum, 1)) > 255 OR UNICODE(Substring(, n.rNum, 1)) <> ASCII(Substring(, n.rNum, 1))`
    – Trisped Mar 21 '12 at 19:15
  • I tried using the original query and found the same results (4:42). It should be noted that the original query does not work in SQL 2008 R2 since object_id is not resolved. It should also be noted that removing `UNICODE(Substring(, n.rNum, 1)) > 255 OR ` from the where statement of my previous query resulted in a very small time reduction (5:12 vs 5:34). – Trisped Mar 21 '12 at 19:28
  • My answer is slower because it is ***correct***. That's usually a pre-requisite for performance testing. So if you could, please correct or withdraw your false claims about my solution in this thread. – RBarryYoung Mar 21 '12 at 21:48
  • What's interesting though is that you went out or your way to point out a minor typographical error in my query (which I usually appreciate), but did not bother to point out that the solution that you are praising will not compile or execute in *any* version of SQL Server. At least not for the Question stated in this thread. Makes me think that you are not exactly either qualified or objective in this testing. – RBarryYoung Mar 21 '12 at 22:03
  • Perfect, just what I needed. Thank you very much!! A small typo: "Join cteNumbers n ON n <= Len(CAST(TXT As NVarchar(MAX))", missing a closing bracket. Needs to be "Join cteNumbers n ON n <= Len(CAST(TXT As NVarchar(MAX)))" – Alan Ball Sep 14 '17 at 13:24
2

I started with @CC1960's solution but found an interesting use case that caused it to fail. It seems that SQL Server will equate certain Unicode characters to their non-Unicode approximations. For example, SQL Server considers the Unicode character "fullwidth comma" (http://www.fileformat.info/info/unicode/char/ff0c/index.htm) the same as a standard ASCII comma when compared in a WHERE clause.

To get around this, have SQL Server compare the strings as binary. But remember, nvarchar and varchar binaries don't match up (16-bit vs 8-bit), so you need to convert your varchar back up to nvarchar again before doing the binary comparison:

select *
from my_table
where CONVERT(binary(5000),my_table.my_column) != CONVERT(binary(5000),CONVERT(nvarchar(1000),CONVERT(varchar(1000),my_table.my_column)))
neuracnu
  • 119
  • 3
  • 7
1

If you are looking for a specific unicode character, you might use something like below.

   select  Fieldname from 
     (
      select Fieldname,
             REPLACE(Fieldname COLLATE Latin1_General_BIN,
             NCHAR(65533) COLLATE Latin1_General_BIN,
             'CustomText123') replacedcol
      from table
     ) results where results.replacedcol like '%CustomText123%'
johnnyRose
  • 7,310
  • 17
  • 40
  • 61
0

My previous answer was confusing UNICODE/non-UNICODE data. Here is a solution that should work for all situations, although I'm still running into some anomalies. It seems like certain non-ASCII unicode characters for superscript characters are being confused with the actual number character. You might be able to play around with collations to get around that.

Hopefully you already have a numbers table in your database (they can be very useful), but just in case I've included the code to partially fill that as well.

You also might need to play around with the numeric range, since unicode characters can go beyond 255.

CREATE TABLE dbo.Numbers
(
    number  INT NOT NULL,
    CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (number)
)
GO
DECLARE @i INT

SET @i = 0

WHILE @i < 1000
BEGIN
    INSERT INTO dbo.Numbers (number) VALUES (@i)

    SET @i = @i + 1
END
GO

SELECT *,
    T.ID, N.number, N'%' + NCHAR(N.number) + N'%'
FROM
    dbo.Numbers N
INNER JOIN dbo.My_Table T ON
    T.description LIKE N'%' + NCHAR(N.number) + N'%' OR
    T.summary LIKE N'%' + NCHAR(N.number) + N'%'
and t.id = 1
WHERE
    N.number BETWEEN 127 AND 255
ORDER BY
    T.id, N.number
GO
Tom H
  • 46,766
  • 14
  • 87
  • 128
  • The way I understand it, ASCII is 7 bit and varchar is 8 bit. So varchar can still store a lot of characters that aren't ascii, like ä or é. – Andomar Mar 26 '09 at 19:32
  • 1
    Extended ASCII is 8 bit, which is what some people are referring to when they say "ASCII". I'll edit the post to limit to normal ASCII as well. – Tom H Mar 27 '09 at 14:43
  • THis won't work for the % or _ characters? And isn't an inner join slower than a LIKE statement (like in my answer)? – Andomar Mar 27 '09 at 16:09
-1

-- This is a very, very inefficient way of doing it but should be OK for -- small tables. It uses an auxiliary table of numbers as per Itzik Ben-Gan and simply -- looks for characters with bit 7 set.

SELECT  *
FROM    yourTable as t
WHERE   EXISTS ( SELECT *
                 FROM   msdb..Nums as NaturalNumbers
                 WHERE  NaturalNumbers.n < LEN(t.string_column)
                        AND ASCII(SUBSTRING(t.string_column, NaturalNumbers.n, 1)) > 127)  
TheSoftwareJedi
  • 34,421
  • 21
  • 109
  • 151
Paul Harrington
  • 772
  • 4
  • 9