How can I find Unicode/non-ASCII characters in an NTEXT field in a SQL Server 2005 table?

Question

I have a table with a couple thousand rows. The description and summary fields are NTEXT, and sometimes have non-ASCII chars in them. How can I locate all of the rows with non ASCII characters?

score 66 · Answer 1 · edited Nov 18 '16 at 02:05

66

I have sometimes been using this "cast" statement to find "strange" chars

select 
    *
from 
    <Table>
where 
    <Field> != cast(<Field> as varchar(1000))

edited Nov 18 '16 at 02:05

Kong

8,792
15
68
98

answered Apr 21 '10 at 13:54

CC1960

685
5
2

1

This worked for me and was much faster then RBarryYoung's solution. – Trisped Mar 21 '12 at 19:29
2

It doesn't work if the collation is set to an "non-ASCII" collation. – deerchao May 06 '13 at 10:24
Excellent & simple solution :-) – Venkatesh Muniyandi Jun 30 '16 at 19:05
2

@CC1960 I'm getting the error `The data types ntext and nvarchar(max) are incompatible in the not equal to operator.` Ideas? - Using SQL Server 2005 – Felipe Correa Apr 20 '17 at 21:31
1

Very helpful if you need Unicode but non-ASCII characters; I've also been dealing with some Latvian which isn't contained in ISO 8859-1 – Nate Anderson Sep 29 '17 at 01:07

Andomar · Answer 2 · 2009-03-27T17:43:44.340

First build a string with all the characters you're not interested in (the example uses the 0x20 - 0x7F range, or 7 bits without the control characters.) Each character is prefixed with |, for use in the escape clause later.

-- Start with tab, line feed, carriage return
declare @str varchar(1024)
set @str = '|' + char(9) + '|' + char(10) + '|' + char(13)

-- Add all normal ASCII characters (32 -> 127)
declare @i int
set @i = 32
while @i <= 127
    begin
    -- Uses | to escape, could be any character
    set @str = @str + '|' + char(@i)
    set @i = @i + 1
    end

The next snippet searches for any character that is not in the list. The % matches 0 or more characters. The [] matches one of the characters inside the [], for example [abc] would match either a, b or c. The ^ negates the list, for example [^abc] would match anything that's not a, b, or c.

select *
from yourtable
where yourfield like '%[^' + @str + ']%' escape '|'

The escape character is required because otherwise searching for characters like ], % or _ would mess up the LIKE expression.

Hope this is useful, and thanks to JohnFX's comment on the other answer.

You may want to add a few(or all) of the characters below 32 as well, especially important would be Carriage Return(13), Line Feed (10), and Tab (9). — Chris Shaffer, Mar 27 '09 at 16:42
I found your technique to be the most reliable way to do this in SQL-Server. — cusman, Mar 11 '13 at 02:46
We combined this with https://stackoverflow.com/a/2084732/426894 and https://www.decisivedata.net/blog/how-to-find-a-hidden-unicode-character-using-sql-server to quickly identify data with invalid characters with the row id, position and unicode value. Thank you! — asawyer, Jul 09 '20 at 14:28

score 11 · Answer 3 · edited Oct 05 '17 at 21:50

11

Here ya go:

SELECT *
FROM Objects
WHERE 
    ObjectKey LIKE '%[^0-9a-zA-Z !"#$%&''()*+,\-./:;<=>?@\[\^_`{|}~\]\\]%' ESCAPE '\'

edited Oct 05 '17 at 21:50

stuartd

70,509
14
132
163

answered Nov 05 '15 at 23:33

petejamd

523
1
10
14

I think this is the best solution. Elegant and performant! Thank you! – SQL_Guy Jul 13 '19 at 16:27

score 3 · Answer 4 · answered Mar 26 '09 at 18:36

3

It's probably not the best solution, but maybe a query like:

SELECT *
FROM yourTable
WHERE yourTable.yourColumn LIKE '%[^0-9a-zA-Z]%'

Replace the "0-9a-zA-Z" expression with something that captures the full ASCII set (or a subset that your data contains).

answered Mar 26 '09 at 18:36

Chris Shaffer

32,199
5
49
61

Wouldn't this just match rows that contain any ASCII character, as opposed only ASCII characters? – Andomar Mar 26 '09 at 18:38
The ^ marker at the front of the expression means NOT, so no. It would get any row that had at least one character that wasn't in the ranges specified. – JohnFx Mar 26 '09 at 19:08
How can I put the full ascii set in that expression? it's HTML data that I'm looking at so "/><' etc... is in there. – TheSoftwareJedi Mar 27 '09 at 14:22
The answer I placed checks against the full ascii set, it should work with >/< because those get escaped. – Andomar Mar 27 '09 at 14:58
@TheSoftwareJedi - you would need to add the characters to the list (eg, '%[^0-9a-zA-Z<>/"'']%'). Andomar's solution programmatically builds a complete list, you could use that. – Chris Shaffer Mar 27 '09 at 16:41

RBarryYoung · Answer 5 · 2017-09-14T13:33:34.450

3

Technically, I believe that an NCHAR(1) is a valid ASCII character IF & Only IF UNICODE(@NChar) < 256 and ASCII(@NChar) = UNICODE(@NChar) though that may not be exactly what you intended. Therefore this would be a correct solution:

;With cteNumbers as
(
    Select ROW_NUMBER() Over(Order By c1.object_id) as N
    From sys.system_columns c1, sys.system_columns c2
)
Select Distinct RowID
From YourTable t
    Join cteNumbers n ON n <= Len(CAST(TXT As NVarchar(MAX)))
Where UNICODE(Substring(TXT, n.N, 1)) > 255
    OR UNICODE(Substring(TXT, n.N, 1)) <> ASCII(Substring(TXT, n.N, 1))

This should also be very fast.

edited Sep 14 '17 at 13:33

answered Jun 02 '09 at 20:55

RBarryYoung

55,398
14
96
137

ASCII is only up to 127. Also your numbers cte is weird - the final solution should use a preexisting numbers table instead of it. Otherwise, this is how I would do it. – Adam A Jun 20 '09 at 12:41
1

FYI: "Not appearing weird" is not a criteria that I (or most of the uber-geeks on this site) really care about. And the advantage of not using a pre-existing numbers table, is that it works even if said table does not pre-exist (which it usually does not). Finally, if you test my solution, you will find that it performs comparable to and sometimes even better than a pre-existing numbers table. – RBarryYoung Jan 01 '12 at 22:51
I don't really want to argue - I mostly agree with your solution. But "weird" is usually hard to read and unmaintainable. That's how I meant it. It's also not DRY if you ever have two different queries that need numbers - you have to write the unmaintainable cte twice. Reading row numbers from a join of system_columns on itself is not very straightforward, but reading numbers from a numbers table would be very easy to debug. – Adam A Feb 27 '12 at 21:55
I tried the above but found it to be **much** slower then CC1960's answer (5:34 compared to CC1960's 0:09). I did rewrite the query to move the "with" table inline: `SELECT DISTINCT FROM t JOIN (SELECT ROW_NUMBER() OVER(ORDER BY c1.object_id) as rNum From sys.system_columns c1, sys.system_columns c2) AS n ON n.rNum <= Len() Where UNICODE(Substring(, n.rNum, 1)) > 255 OR UNICODE(Substring(, n.rNum, 1)) <> ASCII(Substring(, n.rNum, 1))`
– Trisped Mar 21 '12 at 19:15
I tried using the original query and found the same results (4:42). It should be noted that the original query does not work in SQL 2008 R2 since object_id is not resolved. It should also be noted that removing `UNICODE(Substring(, n.rNum, 1)) > 255 OR ` from the where statement of my previous query resulted in a very small time reduction (5:12 vs 5:34). – Trisped Mar 21 '12 at 19:28
My answer is slower because it is ***correct***. That's usually a pre-requisite for performance testing. So if you could, please correct or withdraw your false claims about my solution in this thread. – RBarryYoung Mar 21 '12 at 21:48
What's interesting though is that you went out or your way to point out a minor typographical error in my query (which I usually appreciate), but did not bother to point out that the solution that you are praising will not compile or execute in *any* version of SQL Server. At least not for the Question stated in this thread. Makes me think that you are not exactly either qualified or objective in this testing. – RBarryYoung Mar 21 '12 at 22:03
Perfect, just what I needed. Thank you very much!! A small typo: "Join cteNumbers n ON n <= Len(CAST(TXT As NVarchar(MAX))", missing a closing bracket. Needs to be "Join cteNumbers n ON n <= Len(CAST(TXT As NVarchar(MAX)))" – Alan Ball Sep 14 '17 at 13:24

score 2 · Answer 6 · answered Apr 17 '13 at 02:15

I started with @CC1960's solution but found an interesting use case that caused it to fail. It seems that SQL Server will equate certain Unicode characters to their non-Unicode approximations. For example, SQL Server considers the Unicode character "fullwidth comma" (http://www.fileformat.info/info/unicode/char/ff0c/index.htm) the same as a standard ASCII comma when compared in a WHERE clause.

To get around this, have SQL Server compare the strings as binary. But remember, nvarchar and varchar binaries don't match up (16-bit vs 8-bit), so you need to convert your varchar back up to nvarchar again before doing the binary comparison:

select *
from my_table
where CONVERT(binary(5000),my_table.my_column) != CONVERT(binary(5000),CONVERT(nvarchar(1000),CONVERT(varchar(1000),my_table.my_column)))

score 1 · Answer 7 · edited Nov 01 '16 at 17:43

1

If you are looking for a specific unicode character, you might use something like below.

   select  Fieldname from 
     (
      select Fieldname,
             REPLACE(Fieldname COLLATE Latin1_General_BIN,
             NCHAR(65533) COLLATE Latin1_General_BIN,
             'CustomText123') replacedcol
      from table
     ) results where results.replacedcol like '%CustomText123%'

edited Nov 01 '16 at 17:43

johnnyRose

7,310
17
40
61

answered Sep 22 '15 at 15:06

Mohan Billakanti

11
2

Tom H · Answer 8 · 2009-03-27T15:50:28.390

My previous answer was confusing UNICODE/non-UNICODE data. Here is a solution that should work for all situations, although I'm still running into some anomalies. It seems like certain non-ASCII unicode characters for superscript characters are being confused with the actual number character. You might be able to play around with collations to get around that.

Hopefully you already have a numbers table in your database (they can be very useful), but just in case I've included the code to partially fill that as well.

You also might need to play around with the numeric range, since unicode characters can go beyond 255.

CREATE TABLE dbo.Numbers
(
    number  INT NOT NULL,
    CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (number)
)
GO
DECLARE @i INT

SET @i = 0

WHILE @i < 1000
BEGIN
    INSERT INTO dbo.Numbers (number) VALUES (@i)

    SET @i = @i + 1
END
GO

SELECT *,
    T.ID, N.number, N'%' + NCHAR(N.number) + N'%'
FROM
    dbo.Numbers N
INNER JOIN dbo.My_Table T ON
    T.description LIKE N'%' + NCHAR(N.number) + N'%' OR
    T.summary LIKE N'%' + NCHAR(N.number) + N'%'
and t.id = 1
WHERE
    N.number BETWEEN 127 AND 255
ORDER BY
    T.id, N.number
GO

The way I understand it, ASCII is 7 bit and varchar is 8 bit. So varchar can still store a lot of characters that aren't ascii, like ä or é. — Andomar, Mar 26 '09 at 19:32
Extended ASCII is 8 bit, which is what some people are referring to when they say "ASCII". I'll edit the post to limit to normal ASCII as well. — Tom H, Mar 27 '09 at 14:43
THis won't work for the % or _ characters? And isn't an inner join slower than a LIKE statement (like in my answer)? — Andomar, Mar 27 '09 at 16:09

score -1 · Answer 9 · edited May 29 '09 at 00:01

-- This is a very, very inefficient way of doing it but should be OK for -- small tables. It uses an auxiliary table of numbers as per Itzik Ben-Gan and simply -- looks for characters with bit 7 set.

SELECT  *
FROM    yourTable as t
WHERE   EXISTS ( SELECT *
                 FROM   msdb..Nums as NaturalNumbers
                 WHERE  NaturalNumbers.n < LEN(t.string_column)
                        AND ASCII(SUBSTRING(t.string_column, NaturalNumbers.n, 1)) > 127)

How can I find Unicode/non-ASCII characters in an NTEXT field in a SQL Server 2005 table?

9 Answers9

Linked