0

Right, so I am trying to write a REGEX function. Trying to write it as sort of a stored procedure, I guess (assuming that if that is the right way to go about it)?

Problem : I have a table with millions of rows. I would like to create a REGEX function that scrapes out the 6 alpha numeric CML code (mentioned in the table). Sometimes, it does not even mention the word CML and the code is still there. See Row 5 for reference below

My approach

SELECT * 
FROM [Reporting].[dbo].[Master] 
WHERE [Notes] LIKE '[C][M][L]%'

I am not too sure if that is the right way to go about it or not.

Desired result : I want to be able to write a code using a REGEX function which can look for such patterns in the notes column and put it as a separate column against its respective Part ID

This is the table:

Table

Edit - A typo at my end occurred for L987234. The length of these characters is also 6, hence, it actually is L98723

  • If you want Regex functions in SQL Server, you need to use CLR objects. – Thom A Aug 01 '22 at 18:26
  • If you can't precisely describe what makes a valid CML code (other than being six characters) then how is a regex going to help? – shawnt00 Aug 01 '22 at 18:32
  • Those 6 characters is the only way to identify it. Sometimes it might mention "CML" sometimes, it may not but based on my tribal knowledge, I know those 6 characters will always be the code I want to extract. – Attitude Black Aug 01 '22 at 18:35
  • @Larnu, can you elaborate it briefly? Thanks – Attitude Black Aug 01 '22 at 18:35
  • There is no in built support for Regex on SQL Server, @AttitudeBlack . To be able to use Regex you have to create CLR functions so that you can use those instead. – Thom A Aug 01 '22 at 18:37
  • @Larnu, Alright. Thank-you. I will look into it. Appreciate it – Attitude Black Aug 01 '22 at 18:41
  • @AaronBertrand Alright let me take a look at it. Thanks – Attitude Black Aug 01 '22 at 19:11
  • @AaronBertrand Right, so the issue is that the notes section column is a manual entry and the people who write these notes sometimes just mentions these 6 characters which represents a vendor. The table that you see above is a mirror reflection of the actual data. And that is how it exists because it was manually entered. – Attitude Black Aug 01 '22 at 19:31
  • @AaronBertrand That is exactly what I am trying to figure out. I can't use a substring because they don't exist in a specific place within a string. Could be anywhere. From the tribal knowledge, all I know is if I see those 6 characters, that is a vendor and that is what I need to extract from a string of a certain length. – Attitude Black Aug 01 '22 at 19:57
  • @AaronBertrand That's a typo at my end. – Attitude Black Aug 01 '22 at 20:15
  • So how - in English, not in RegEx or anything else programmatic - are you going to differentiate between `B12345` and some kind of word like `BABBLE`? It's clear your codes can start with either a letter or a number, do they have to contain at least one of each to be valid? Are they always upper case? Can you share the magic that dictates how these codes are generated in the first place? – Aaron Bertrand Aug 01 '22 at 20:43
  • Also note that `WHERE [Notes] LIKE '[C][M][L]%'` is the same as `WHERE Notes LIKE 'CML%'`. Wrapping square brackets around single characters and column names that don't need it only helps in reducing readability. And that `LIKE` pattern obviously can't be "the right way to go about it" because in the previous sentence you said yourself that `CML` is not always there - so you should already be able to verify that that query won't return the row with `Wheels`. – Aaron Bertrand Aug 01 '22 at 20:44
  • @AaronBertrand Yes you are right in understanding that the 6 characters could start from either a number or an alphabet. All of the alphabet, whether it starts, end or is between numbers are in upper case These codes are taken directly from the Aircraft Manual (pre-determined) but entered in the notes sections manually. – Attitude Black Aug 01 '22 at 20:56
  • @AaronBertrand Yes you are right for the query part too that it will miss Wheels. That is what I am trying to figure out that how can I come up with an expression that helps filter it whether it mentions CML or not. In plain English, please filter 6 characters (given the criteria) in a "n" number of string from the notes column. – Attitude Black Aug 01 '22 at 20:57

1 Answers1

1

Assuming SQL Server 2016 or greater:

SELECT m.<columns>,
  Code = s.value
FROM Reporting.dbo.[Master] AS m
CROSS APPLY STRING_SPLIT(m.Notes, ' ') AS s
 WHERE LEN(s.value) = 6
 AND s.value LIKE '%' + REPLICATE('[0-9A-Z]', 6) + '%'
 AND s.value LIKE N'%[0-9]%'  -- make sure it has at least one digit
 AND s.value LIKE N'%[A-Z]%'; -- make sure it has at least one letter

If you're on a lower version, then create your own function (like this one), then replace STRING_SPLIT with dbo.SplitString.

Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490