Detect UNICODE characters that are not ASCII in table

Question

I have the following table:

Select
   name,
   address,
   description
from dbo.users

I would like to search all this table for any characters that are UNICODE but not ASCII. Is this possible?

Are you looking to return the characters themselves or only identify the records with such characters? @johnnybell has answered already for the latter case. — shree.pat18, Apr 23 '15 at 09:55
@shree.pat18 just the records will do ...some sort of pointer so I can find the "guilty parties" — whytheq, Apr 23 '15 at 10:08

score 3 · Answer 1 · answered Apr 23 '15 at 09:43

3

You can find non-ASCII characters quite simply:

SELECT NAME, ADDRESS, DESCRIPTION
FROM DBO.USERS
WHERE NAME != CAST(NAME AS VARCHAR(4000))
OR ADDRESS != CAST(ADDRESS AS VARCHAR(4000))
OR DESCRIPTION != CAST(DESCRIPTION AS VARCHAR(4000))

answered Apr 23 '15 at 09:43

John Bell

2,350
1
14
23

well it runs. Just need to create some text with pure unicode in it to put my mind at rest that it highlights problems – whytheq Apr 23 '15 at 10:04
1

Thanks Johnny: `SELECT ('vsdǣf' = CAST('vsdǣf' AS VARCHAR(4000)))` ...works – whytheq Apr 23 '15 at 10:07
This isn't going to check for ASCII; It's going to check for the database default encoding, which probably isn't ASCII. To see what it is, see this [answer](http://stackoverflow.com/a/7321208/2226988). – Tom Blodget Apr 23 '15 at 11:53

Solomon Rutzky · Accepted Answer · 2019-05-01T03:38:30.787

If you want to determine if there are any characters in an NVARCHAR / NCHAR / NTEXT column that cannot be converted to VARCHAR, you need to convert to VARCHAR using the _BIN2 variation of the collation being used for that particular column. For example, if a particular column is using Albanian_100_CI_AS, then you would specify Albanian_100_BIN2 for the test. The reason for using a _BIN2 collation is that non-binary collations will only find instances where there is at least one character that does not have any mapping at all in the code page and is thus converted into ?. But, non-binary collations do not catch instances where there are characters that don't have a direct mapping into the code page, but instead have a "best fit" mapping. For example, the superscript 2 character, ², has a direct mapping in code page 1252, so definitely no problem there. On the other hand, it doesn't have a direct mapping in code page 1250 (used by the Albanian collations), but it does have a "best fit" mapping which converts it into a regular 2. The problem with the non-binary collation is that 2 will equate to ² and so it won't register as a row that can't convert to VARCHAR. For example:

SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE French_100_CI_AS); -- Code Page 1252
-- ²
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS); -- Code Page 1250
-- 2

SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS)
WHERE  N'²' <> CONVERT(NVARCHAR(MAX),
                       CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS));
-- (no rows returned)

SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2)
WHERE  N'²' <> CONVERT(NVARCHAR(MAX),
                       CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2));
-- 2

Ideally you would convert back to NVARCHAR explicitly for the code to be clear on what it's doing, though not doing this will still implicitly convert back to NVARCHAR, so the behavior is the same either way.

Please note that only MAX types are used. Do not use NVARCHAR(4000) or VARCHAR(4000) else you might get false positives due to truncation of data in NVARCHAR(MAX) columns.

So, in terms of the example code in the question, the query would be (assuming that a Latin1_General collation is being used):

SELECT usr.*
FROM   dbo.[users] usr
WHERE  usr.[name] <> CONVERT(NVARCHAR(MAX),
               CONVERT(VARCHAR(MAX), usr.[name] COLLATE Latin1_General_100_BIN2))
OR     usr.[address] <> CONVERT(NVARCHAR(MAX),
               CONVERT(VARCHAR(MAX), usr.[address] COLLATE Latin1_General_100_BIN2))
OR     usr.[description] <> CONVERT(NVARCHAR(MAX),
               CONVERT(VARCHAR(MAX), usr.[description] COLLATE Latin1_General_100_BIN2));

score 1 · Answer 3 · answered Apr 23 '15 at 09:43

There doesn't seem to be an inbuilt function for this as far as I can tell. A brute force approach is to pass each character to ascii and then pass the result to char and check if it returns '?', which would mean the character is out of range. You can write a UDF with the below code as reference, but I should think that it is a very inefficient solution:

    declare @i int = 1
    declare @x nvarchar(10) = N'vsdǣf'
    declare @result nvarchar(100) = N''

    while (@i < len(@x))
    begin
     if char(ascii(substring(@x,@i,1))) = '?'
       begin
         set @result = @result + substring(@x,@i,1)       
       end
     set @i = @i+1
    end       

    select @result

Detect UNICODE characters that are not ASCII in table

3 Answers3

Linked