Verify that a varchar follows a certain pattern

Question

CREATE TABLE CODES(analysis_date Date, code Varchar(50));

INSERT INTO CODES VALUES('2021-01-01','PROD_OP21_115');
INSERT INTO CODES VALUES('2021-02-06','PROD_TR21_009R');
INSERT INTO CODES VALUES('2021-03-14','PROD_TR21_002');
INSERT INTO CODES VALUES('2021-05-01','PROD_OP21_050R');
INSERT INTO CODES VALUES('2020-06-04','PROD_OP20_450');

I have a column that stores a string, for example PROD_OP21_182. I am developing a query that returns all codes that are correctly entered (meet all conditions).

It always starts with PROD_.
After PROD_, there should be one of these acronyms: OP, UG or TR.
21 means the last two digits of the year given by another column (analysis date).
A last _ and three numbers.
The code can end with an R (it is optional). Example: PROD_OP21_182R.
Uppercase always.

This is my attempt:

select CODE
from CODES
where SUBSTRING(CODE, 1, 4) = 'PROD'
and SUBSTRING(CODE, 5, 1) = '_'
and SUBSTRING(CODE, 6, 2) in ('OP','UG','TR')
and RIGHT(YEAR(analysis_date), 2) = SUBSTRING(CODE, 8, 2) 
and SUBSTRING(CODE, 10, 1) = '_'
and ( 
(len(CODE) = 13)
or 
(len(CODE) = 14) and CODE like '%R')

It works partially. I still need to check that the three characters in SUBSTRING(dispatch_number, 11, 3) are numbers, and that the rule of capital letters is fulfilled (simply, there cannot be lowercase in the code).

You're just writing a query? You aren't asking us how to constrain future input to the table? — Ann L., Sep 24 '21 at 20:12
@AnnL. The query is supposed to return all the codes that meet the structure. — TomuRain, Sep 24 '21 at 20:14
I'll edit the question clarifying this, because it was ambiguous. — TomuRain, Sep 24 '21 at 20:18
I'd probably just chain a bunch of `LIKE`s -- not elegant but it gets the job done, has predictable indexing behavior and is easy to understand. `code LIKE CONCAT('PROD_OP', YEAR(analysis_date) % 100, '_[0-9][0-9][0-9]') COLLATE Latin1_General_CS_AS OR code LIKE CONCAT('PROD_OP', YEAR(analysis_date) % 100, '_[0-9][0-9][0-9]R') COLLATE Latin1_General_CS_AS` and again for `UG` and `TR`. The repeated `COLLATE` can be left out if you make it part of your column definition instead. — Jeroen Mostert, Sep 24 '21 at 20:25
for numbers, did you try ISNUMERIC ? ISNUMERIC(substring(code,11,3))=1 — Hallah, Sep 24 '21 at 20:29
@Hallah: `ISNUMERIC` is a bad function that should almost never be used. For example, `'-'`, `'.'` and `'$'` are all strings `ISNUMERIC` considers "numeric" that most sane people would not. — Jeroen Mostert, Sep 24 '21 at 20:36
While checking caps of your string, you can use [COLLATE Latin1_General_CS_AS](https://stackoverflow.com/a/31942798/12834817) — Hallah, Sep 24 '21 at 20:39
@Jeroen [also `'2d4'`, `'5e2'`, and `char(9)`](https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=23d56113d4d28410c759fa9faa4eae65) — Aaron Bertrand, Sep 24 '21 at 20:46
If at all possible you should see about fixing this table. It violates 1NF by storing multiple pieces of information in a single tuple. Those should be columns in the table and then the code could be a computed column by munging all that information into the string you want. The big advantage to this would be you can then easily generate constraints on each element so they can be validated. — Sean Lange, Sep 24 '21 at 21:22
Unless I am missing something - the only values you need to store are the location and the last portion of the string. If you have those 2 columns then you can generate the rest of the string and put that in a computed column. Then - there is no chance the computed column is not 'entered' incorrectly. — Jeff, Sep 26 '21 at 15:34
The one comment I'll add to the excellent ones you already have, is that if you're looking for outliers, you're going to need to NEGATE all of this and look for ones that DON'T fit your pattern. — Ann L., Sep 27 '21 at 12:17

score 1 · Answer 1 · answered Sep 24 '21 at 21:47

First of all kudos on posting sample data and a table structure to make it easy for anybody to help. I wish everybody could do that. As I mentioned in my comment the table structure you have here is why this is so difficult. You are storing a bunch of information in a single tuple which violates 1NF. However we can't always fix that and have to query the mess we are given.

This is one way to solve this one. Instead of substring I am using parsename. It is a little more code but I find using substring really fiddly because when something changes in the logic you have to go fiddle with every instance of substring to get the new values right.

select *
from CODES
where parsename(replace(code, '_', '.'), 3) COLLATE Latin1_General_CS_AS = 'PROD'
    AND
    (
        left(parsename(replace(code, '_', '.'), 2), 2) in ('OP','UG','TR')
        OR
        try_parse(right(parsename(replace(code, '_', '.'), 2), 2) as int) = datepart(year, analysis_date)
    )
    AND
    (
        parsename(replace(code, '_', '.'), 1) like '[0-9][0-9][0-9]' 
        OR
        parsename(replace(code, '_', '.'), 1) COLLATE Latin1_General_CS_AS like '[0-9][0-9][0-9][R]'
    )

Everything must go in a tuple since each code represents a drill hole in a mining company. PROD stands for production drill; OP, UG, etc. they are vein locations (open pit, underground, etc). Anyway, the reason why everything should go in a single tuple (in addition to company policies), is that this is exported to a geographic modeling program, and geologists want each name to be as representative as possible. — TomuRain, Sep 24 '21 at 22:11

score 1 · Answer 2 · answered Sep 24 '21 at 23:55

This code seems to meet the requirements. All [A-Z] letter comparisons are done specifying the case sensitive Latin1_General_CS_AS collation.

select *
from #codes
where substring(code, 1, 5) COLLATE Latin1_General_CS_AS='PROD_'
      and SUBSTRING(code, 6, 2) COLLATE Latin1_General_CS_AS in ('OP','UG','TR')
      and RIGHT(YEAR(analysis_date), 2) = SUBSTRING(CODE, 8, 2)
      and SUBSTRING(code, 10, 4) like '_[0-9][0-9][0-9]'
      and (len(CODE) = 13
           or 
           (len(CODE) = 14 and right(CODE, 1) COLLATE Latin1_General_CS_AS='R'));

analysis_date   code
2021-01-01      PROD_OP21_115
2021-02-06      PROD_TR21_009R
2021-03-14      PROD_TR21_002
2021-05-01      PROD_OP21_050R
2020-06-04      PROD_OP20_450

Serg · Answer 3 · 2021-09-24T20:33:33.840

0

You can use a case sensitive collation, for example

..
and SUBSTRING(CODE, 6, 2) COLLATE Latin1_General_CS_AS in ('OP','UG','TR')
..

and LIKE with a simple regex supported by SQL Server

.. 
and SUBSTRING(CODE, 11, 3) like '[0-9][0-9][0-9]' 
..

edited Sep 24 '21 at 20:33

answered Sep 24 '21 at 20:25

Serg

22,285
5
21
48

since there are 3 numbers, shouldn't be like [0-9][0-9][0-9]? – Hallah Sep 24 '21 at 20:31
@Hallah, definitely. Thank you. – Serg Sep 24 '21 at 20:33

Verify that a varchar follows a certain pattern

3 Answers3