0

I have a table with many columns with 2.1M rows. Here are the columns which are related with my problem :

Column_name Type    Computed    Length  Prec    Scale   Nullable    TrimTrailingBlanks  FixedLenNullInSource    Collation
id          int     no          4       10      0       no          (n/a)               (n/a)                   NULL
val         varchar no          15                      yes         no                  yes                     SQL_Latin1_General_CP1_CI_AS

I want to return rows which contain characters other than A-Z, a-z, 0-9, (space) and _ in column val. Sample Data :

INSERT INTO tabl
            (id, val)
VALUES      (1, 'Extemporè'),
            (2, 'Aâkash'),
            (3, 'Driver 12'),
            (4, 'asd'),
            (5, '10'),
            (6, 'My_Car'),
            (7, 'Johnson & Sons'),
            (8, 'Johan''s Service'), 
            (9, 'Indus Truck')

Expected output :

id  val
--  -----------
1   Extemporè
2   Aâkash
7   Johnson & Sons
8   Johan's Service

I found similar question here but it is also not giving expected results :

SELECT *
FROM   tabl
WHERE  val LIKE '%[^A-Z0-9 _]%' 

Gives result :

id  val
--  ----------
7   Johnson & Sons
8   Johan's Service
Community
  • 1
  • 1
Deep
  • 3,162
  • 1
  • 12
  • 21
  • SELECT * FROM tbl where PATINDEX('%[^a-zA-Z0-9 _]%',Col) > 0 – knkarthick24 Nov 26 '14 at 06:27
  • @knkarthick24 It is giving same result as my query, not returning row with id 1 and 2. – Deep Nov 26 '14 at 06:28
  • What results are you getting that you think you shouldn't? What's your collation? Try `WHERE val LIKE '%[^a-zA-Z0-9 _]%' COLLATE Latin1_General_CI_AI`. Actually, you can probably use `'%[^A-Z0-9 _]%'` – Bacon Bits Nov 26 '14 at 06:35
  • @BaconBits Collation information is given in Column information box. I tried your query, it is also giving same result. Edited question with your suggestion of using `'%[^A-Z0-9 _]%'`. – Deep Nov 26 '14 at 06:40
  • It's gotta be a collation problem. Try coercing both sides? `WHERE val COLLATE Latin1_General_CI_AI LIKE '%[^a-zA-Z0-9 _]%' COLLATE Latin1_General_CI_AI` – Bacon Bits Nov 26 '14 at 06:44
  • @BaconBits Thanks for attention. I don't know much about collation, I am reading the article on Microsoft site and above where clause is also giving same result. – Deep Nov 26 '14 at 06:48
  • 1
    [Here](http://stackoverflow.com/questions/2461522/how-do-i-perform-an-accent-insensitive-compare-e-with-%C3%A8-%C3%A9-%C3%AA-and-%C3%AB-in-sql-ser) is a similar question. – Bacon Bits Nov 26 '14 at 06:50

2 Answers2

3

I would do this with the help of a collation like Latin1_General_BIN like this:

SELECT *
FROM   tabl
WHERE  val COLLATE Latin1_General_BIN LIKE '%[^A-Za-z0-9 _]%'

It would seem easier this way because BIN collations are both case-sensitive and accent-sensitive and, moreover, accented characters are collated separately from non-accented ones. The latter means that it is easy to specify non-accented letters in the form of a range. (But case sensitivity means you also have to specify letters of both cases explicitly, as you can see above.)

Andriy M
  • 76,112
  • 17
  • 94
  • 154
1

updated answer: the use of temporary table is to exclude values such as "Driver" or "Indus Truck"; the temporary table also forces a collation change for values such as "Aâkash" - this is to make sure correct values are not qualified for exclusion in the join.
Note: special characters such as ' or & that are contained in correct values must be manually added to the list (where marked below).

create table #tabl(id int, val varchar(15))

insert #tabl(id, val)
select i.id, cast(i.val as varchar(200)) Collate SQL_Latin1_General_CP1253_CI_AI as val
from tabl i
where i.val <> upper(i.val) Collate SQL_Latin1_General_CP1_CS_AS
    and i.val <> lower(i.val) Collate SQL_Latin1_General_CP1_CS_AS
    and i.val not like '%[0-9]%'
    and i.val not like '%[_]%'
    and i.val not like '%[]%'
    and i.val not like '%[''&]%' -- add special characters (like ' or &) that are permitted in this list; 
                            -- this is the only "manual" requirement for this solution to work.

select t.id, t.val
from tabl t
left join #tabl tt on t.val = tt.val
where tt.val is null
    and t.val <> upper(t.val) Collate SQL_Latin1_General_CP1_CS_AS
    and t.val <> lower(t.val) Collate SQL_Latin1_General_CP1_CS_AS
    and t.val not like '%[0-9]%'
    and t.val not like '%[_]%'
    and t.val not like '%[]%'
Eduard Uta
  • 2,477
  • 5
  • 26
  • 36
  • Thanks for the answer. I checked your query with actual data and I found that it will return those records too which have combination of upper case and lower case alphabets. for example if the data is `Driver` instead of `Driver 12` in row 3 then it will return that which I don't want. Updated such record in sample data. – Deep Nov 26 '14 at 09:05
  • @DeepakPawar, updated answer to match the additional rule. Please retry and also check the note (the special characters that are allowed in values should be identified and noted there in the list). – Eduard Uta Nov 26 '14 at 10:32
  • This will work so +1 but I just got an answer from another source that define `a-z` characters explicitly in the search string like `WHERE val LIKE '%[^abcdefghijklmnopqrstuvwxyz0-9_ ]%'` and I suppose it is easier to add valid characters string than special characters. – Deep Nov 26 '14 at 10:54