0

I've got ID# that are numeric strings with at least 4 digits. They are supposed to be in the ID column, but sometimes they are buried in different spots within free text in a separate Comments column, with 0s or blanks in the ID column. There are also 4-digit years in the Comments, which might trip things up:

Name ID Comments
Room A 48625
Room B 8056
Room C 887654
Room C 0 code is 887012 according to staff in 2021
Cellar ID#5467
Cellar 00000 IDcode#5467 - entered by RS 5/1/2021
Room D unknown updated 2022

Hoping to get something like this, where the non-IDs in the ID column are ignored, the possible IDs are extracted from the Comments, and a column added saying whether the ID found in the ID column or mentioned as "ID" or "Code" in the Comments column, to confirm that it's not a year:

Name ID MentionedAs
Room A 48625 ID column
Room B 8056 ID column
Room C 887654 ID column
Room C 887012 code
Cellar 5467 ID

The beginner basics I have so far is:

select distinct
    NAME, ID, coalesce(ID, COMMENTS) as ID, 
    (case when ID is not null then 'ID column' 
          when Comments like 'ID%' then 'ID' 
          when Comments like '%code%' then 'Code'
          else null
     end) as MentionedAs
from TABLE
WHERE
    (not ID='00000' and ID like '%[0-9][0-9][0-9][0-9]%') or
    (Comments like '%[^/^0-9][0-9][0-9][0-9]%' and Comments like 'ID%' or
     Comments like '%code%) 

but this is incomplete and I'm unsure how to pull just the possible IDs strings out of the Comments, or account for all the IDs that might be wrong in the future (ex., someone entering ten 0s), etc.

I found other solutions where functions are used to extract strings, but I'm not allowed to create functions.

Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188
Jerem
  • 31
  • 2
  • 1
    This is a case of dirty data (no data integrity). Why are people putting the ID's in a comments column instead of the column reserved for it? How many special cases of finding the ID in the comments column are there? There could be hundreds of variations. Seems it would work better if there was a policy in place to make sure that ID is entered in a record before insertion. – Ryan Wilson Oct 27 '22 at 16:37
  • You can use the [SQL Server PATINDEX() Function](https://www.w3schools.com/sql/func_sqlserver_patindex.asp) to retrieve the position of a pattern and then use `SUBSTRING` to extract it. But is would be easier to do this processing in the front end where you can use the full power of a modern programming language. – Olivier Jacot-Descombes Oct 27 '22 at 16:43
  • Completely agree @RyanWilson. Unfortunately policy changes are unlikely to happen and the data will be needing extensive manipulation, as is the way too often. – Jerem Oct 28 '22 at 15:08
  • Thank you @OlivierJacot-Descombes! I tried PATINDEX, but was not able to extract just the targeted strings, since there is additional unneeded text after the targeted strings. Think it could be done with PATINDEX and SUBSTRING through multiple temporary tables. Any suggestions? – Jerem Oct 28 '22 at 15:18
  • SQL is just not the right language for complex expression parsing. If you can use CLR functions on your SQL-Server, you could use [SQL Server Regular Expressions Library Sample](https://techcommunity.microsoft.com/t5/modernization-best-practices-and/sql-server-regular-expressions-library-sample/ba-p/3101875) – Olivier Jacot-Descombes Oct 28 '22 at 15:45

0 Answers0