I have the following table:
Select
name,
address,
description
from dbo.users
I would like to search all this table for any characters that are UNICODE but not ASCII. Is this possible?
I have the following table:
Select
name,
address,
description
from dbo.users
I would like to search all this table for any characters that are UNICODE but not ASCII. Is this possible?
You can find non-ASCII characters quite simply:
SELECT NAME, ADDRESS, DESCRIPTION
FROM DBO.USERS
WHERE NAME != CAST(NAME AS VARCHAR(4000))
OR ADDRESS != CAST(ADDRESS AS VARCHAR(4000))
OR DESCRIPTION != CAST(DESCRIPTION AS VARCHAR(4000))
If you want to determine if there are any characters in an NVARCHAR
/ NCHAR
/ NTEXT
column that cannot be converted to VARCHAR
, you need to convert to VARCHAR
using the _BIN2
variation of the collation being used for that particular column. For example, if a particular column is using Albanian_100_CI_AS
, then you would specify Albanian_100_BIN2
for the test. The reason for using a _BIN2
collation is that non-binary collations will only find instances where there is at least one character that does not have any mapping at all in the code page and is thus converted into ?
. But, non-binary collations do not catch instances where there are characters that don't have a direct mapping into the code page, but instead have a "best fit" mapping. For example, the superscript 2 character, ²
, has a direct mapping in code page 1252, so definitely no problem there. On the other hand, it doesn't have a direct mapping in code page 1250 (used by the Albanian collations), but it does have a "best fit" mapping which converts it into a regular 2
. The problem with the non-binary collation is that 2
will equate to ²
and so it won't register as a row that can't convert to VARCHAR
. For example:
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE French_100_CI_AS); -- Code Page 1252
-- ²
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS); -- Code Page 1250
-- 2
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS)
WHERE N'²' <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_CI_AS));
-- (no rows returned)
SELECT CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2)
WHERE N'²' <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), N'²' COLLATE Albanian_100_BIN2));
-- 2
Ideally you would convert back to NVARCHAR
explicitly for the code to be clear on what it's doing, though not doing this will still implicitly convert back to NVARCHAR
, so the behavior is the same either way.
Please note that only MAX
types are used. Do not use NVARCHAR(4000)
or VARCHAR(4000)
else you might get false positives due to truncation of data in NVARCHAR(MAX)
columns.
So, in terms of the example code in the question, the query would be (assuming that a Latin1_General
collation is being used):
SELECT usr.*
FROM dbo.[users] usr
WHERE usr.[name] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[name] COLLATE Latin1_General_100_BIN2))
OR usr.[address] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[address] COLLATE Latin1_General_100_BIN2))
OR usr.[description] <> CONVERT(NVARCHAR(MAX),
CONVERT(VARCHAR(MAX), usr.[description] COLLATE Latin1_General_100_BIN2));
There doesn't seem to be an inbuilt function for this as far as I can tell. A brute force approach is to pass each character to ascii
and then pass the result to char
and check if it returns '?', which would mean the character is out of range. You can write a UDF with the below code as reference, but I should think that it is a very inefficient solution:
declare @i int = 1
declare @x nvarchar(10) = N'vsdǣf'
declare @result nvarchar(100) = N''
while (@i < len(@x))
begin
if char(ascii(substring(@x,@i,1))) = '?'
begin
set @result = @result + substring(@x,@i,1)
end
set @i = @i+1
end
select @result