2

I want to store URLs in a column. According to RFC 3986, US-ASCII is the character set from which URLs are composed.

SQL Server has the VARCHAR type, which can encode all the characters from the US-ASCII character set, and 128 more that are dependent on the code page.

I want to use a CHECK constraint to ensure the values in the column contains only the printable characters from the US-ASCII character set; in other words, ASCII(@char) >= 32 AND ASCII(@char) < 127 for every character in the string.

I think I can use a LIKE expression to do this in a check constraint, but I can't find the right pattern. I'm trying to adapt Itzik Ben-Gan's trick of matching any character outside the allowed range, which he presents in his article Can I convert this string to an integer?.

In my test harness I create a table @TestData of candidates to insert into my column, a table @Patterns of patterns to be used with the LIKE operator, and then I select the result of matching each pattern against each candidate:

DECLARE @TestData TABLE (
  String VARCHAR(60) COLLATE Latin1_General_CI_AS NOT NULL
);

INSERT INTO @TestData(String)
VALUES
  ('€ÿ'),
  ('ab3'),
  ('http://www.google.com/'),
  ('http://www.example.com/düsseldorf?neighbourhood=Lörick'),
  ('1234');

DECLARE @Patterns TABLE (
  Pattern VARCHAR(12) COLLATE Latin1_General_CI_AS NOT NULL
);

INSERT INTO @Patterns (Pattern)
VALUES
  ('%[^0-9]%'),
  ('%[^' + CHAR(32) + '-' + CHAR(126) + ']%');

SELECT
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS ID,
  String,
  Pattern,
  CASE WHEN String NOT LIKE Pattern THEN 1 ELSE 0 END AS [Match]
FROM @TestData CROSS JOIN @Patterns;

The first row inserted into @Patterns is like the pattern Itzik uses to match non-digit characters. The second row is my attempt to adapt this for characters outside the range of printable US-ASCII characters.

When I execute the above batch, I receive the following result set:

ID   String                                                   Pattern      Match
---  -------------------------------------------------------- ------------ ------
 1    €ÿ                                                       %[^0-9]%     0
 2    ab3                                                      %[^0-9]%     0
 3    http://www.google.com/                                   %[^0-9]%     0
 4    http://www.example.com/düsseldorf?neighbourhood=Lörick   %[^0-9]%     0
 5    1234                                                     %[^0-9]%     1
 6    €ÿ                                                       %[^ -~]%     0
 7    ab3                                                      %[^ -~]%     0
 8    http://www.google.com/                                   %[^ -~]%     0
 9    http://www.example.com/düsseldorf?neighbourhood=Lörick   %[^ -~]%     0
 10   1234                                                     %[^ -~]%     0

As expected, row 5 is a match because the candidate contains only digits. The candidates in rows 1 thru 4 do not contain only digits, so do not match the pattern.

As expected, the candidate in row 6 does not match the pattern because it contains 'high ASCII' characters.

I expect the candidates in rows 7, 8, and 10 to match because they contain only printable US-ASCII characters. But these do not match.

What is wrong with the pattern in the LIKE expression?

Community
  • 1
  • 1
Iain Samuel McLean Elder
  • 19,791
  • 12
  • 64
  • 80
  • 2
    You need to use a binary collate clause. Ranges in the pattern syntax are based on collation sort order. – Martin Smith May 25 '12 at 11:01
  • possible duplicate of [Querying a SQL Server 2008 table to find values in a column containing Unicode characters](http://stackoverflow.com/questions/5184429/querying-a-sql-server-2008-table-to-find-values-in-a-column-containing-unicode-c) – Martin Smith May 25 '12 at 11:01
  • It's not a duplicate; the situation is different. But the solution is the same! – Iain Samuel McLean Elder May 25 '12 at 11:15

1 Answers1

4

As suggested in the question comments, and in the answer to a similar question, I need to use a binary collation clause.

If I change the select statement to this:

SELECT
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS ID,
  String,
  Pattern,
  CASE WHEN String NOT LIKE Pattern COLLATE Latin1_General_BIN THEN 1 ELSE 0 END AS [Match]
FROM @TestData CROSS JOIN @Patterns;

I get the following result set:

ID   String                                                   Pattern      Match
---  -------------------------------------------------------- ------------ ------
 1    €ÿ                                                       %[^0-9]%     0
 2    ab3                                                      %[^0-9]%     0
 3    http://www.google.com/                                   %[^0-9]%     0
 4    http://www.example.com/düsseldorf?neighbourhood=Lörick   %[^0-9]%     0
 5    1234                                                     %[^0-9]%     1
 6    €ÿ                                                       %[^ -~]%     0
 7    ab3                                                      %[^ -~]%     1
 8    http://www.google.com/                                   %[^ -~]%     1
 9    http://www.example.com/düsseldorf?neighbourhood=Lörick   %[^ -~]%     0
 10   1234                                                     %[^ -~]%     1

Now the column Match contains the expected values.

Community
  • 1
  • 1
Iain Samuel McLean Elder
  • 19,791
  • 12
  • 64
  • 80