1

I need to find a continuous date in a string from column name Filename. The string has other numbers in it with dashes(or another character, like an underscore), but I only need the continuous number

The Date needs to be extracted from the filename. (I know the data is just wow, multiple vendors, multiple file naming formats is the cause.)

This question is similar to this question, but it's looking for something different with a different requirement: TSQL: Find a continuous number in a string

Desired result:

enter image description here

Actual Result:

enter image description here

Test Code:

DROP TABLE #dob

CREATE TABLE #dob (
 FILENAME VARCHAR(MAX)
,StudentID INT
)

INSERT INTO #dob
( FILENAME  )
VALUES
 ('Smith John D, 11-23-1980, 1234567.pdf')
,('Doe Jane, _01_22_1980_123456.pdf')
,('John Doe, 567891.pdf' )

--This is what I tried.

SELECT FILENAME
, substring(FileName, patindex('%[0-9][%-%][%_%][0-9][0-9][0-9][0-9][0-9]%', FileName), 8) AS dob
FROM #dob
JM1
  • 1,595
  • 5
  • 19
  • 41

3 Answers3

4

Try it like this:

DROP TABLE #StuID
GO
CREATE TABLE #StuID (
 FILENAME VARCHAR(MAX)
,StudentID INT
)

INSERT INTO #StuID
( FILENAME  )
VALUES
 ('Smith John D, 11-23-1980, 1234567.pdf')
,('Doe Jane, _01_22_1980_123456.pdf')
,('John Doe, 567891.pdf' );

WITH Casted([FileName],ToXml) AS
(
    SELECT [FILENAME] 
          ,CAST('<x>' + REPLACE(REPLACE(REPLACE(REPLACE(REPLACE([FILENAME],'.',' '),'-',' '),'_',' '),',',' '),' ','</x><x>') + '</x>' AS XML)
    FROM #StuID
)
SELECT [FileName] 
      ,numberFragments.value('/x[.>=1 and .<=31][1]','int') AS MonthFragment --using <=12 might bring back the second fragment twice...
      ,numberFragments.value('/x[.>=1 and .<=31][2]','int') AS DayFragment
      ,numberFragments.value('/x[.>=1960 and .<=2050][1]','int') AS YearFragment
      ,numberFragments.value('/x[.>=100000 and .<=10000000][1]','int') AS StudId
FROM Casted
CROSS APPLY (SELECT ToXml.query('/x[not(empty(. cast as xs:int?))]')) A(numberFragments);

The idea in short:

As in the previous answer we will break the string to a XML and filter for fragments castable to int.
The magic ist the XQuery-filtering:

  • We pick the first fragment between 1 and 31, which is the month hopefully
  • We pick the second fragment between 1 and 31 which is the day hopefully
  • We pick the first fragment between 1960 and 2050 which is the year hopefully
  • And we pick the student's id, which is the first fragment between 100000 and 10000000.

Hint: It looks like a nice idea to use <=12 for the month fragment, but I'd use the same filter for day and month to make sure, that we pick the first and the second fragment of the same value region...

Shnugo
  • 66,100
  • 9
  • 53
  • 114
  • Thanks @Shnugo! I was unaware of XQuery. It seems powerful. Outside of breaking strings down, do you regularly use it? If so, what typical things do you use it for? (Not sure if this is a good place to ask this question, if it's not, please let me know where a good place would be is.) Thanks again. – JM1 Aug 06 '19 at 11:43
  • 1
    @JM1 Well, no, a comment is not the best place to go in details about XQuery :-). This is very often used when you have to deal with XMLs. Besides the obvious, it allows for some pretty usefull hacks, where you somehow abuse XML's abilities to deal with data generically. You might [read some of my XQuery related answers](https://stackoverflow.com/search?q=user%3A5089204+%5Bxquery%5D%5Btsql%5D) to get a feeling for it... – Shnugo Aug 06 '19 at 11:48
  • 1
    @JM1 Another usage is to create an unknown set, and pass it as XML parameter to a function or a stored procedure. In [this answer](https://stackoverflow.com/a/39487565/5089204) I provided an approach to transform any SELECT into a HTML-table. In general one might say: This is always a hack and it won't perform great, but it allows to do things you might say they are impossible :-) – Shnugo Aug 06 '19 at 11:52
1

I don't think you have the pattern quite right. Also, you can use a CASE expression to return NULL:

SELECT FILENAME,
       (CASE WHEN FileName LIKE '%[0-9][0-9][-_][0-9][0-9][-_][0-9][0-9][0-9][0-9]%'
             THEN substring(FileName, patindex('%[0-9][0-9][-_][0-9][0-9][-_][0-9][0-9][0-9][0-9]%', FileName), 10)
        END) AS dob
FROM #dob;

You can also dispense with the CASE and use NULLIF():

   substring(FileName, NULLIF(patindex('%[0-9][0-9][-_][0-9][0-9][-_][0-9][0-9][0-9][0-9]%', FileName), 0), 10) as dob
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

Another method would be (after using PATINDEX to find the date) is force to string's format to MM/dd/yyyy and then use an explicit style for the conversion:

SELECT *,
       TRY_CONVERT(date,STUFF(STUFF(SUBSTRING(d.FILENAME,V.I, 10),3,1,'/'),6,1,'/'),101)
FROM #dob d
     CROSS APPLY (VALUES(NULLIF(PATINDEX('%[0-9][0-9]_[0-9][0-9]_[0-9][0-9][0-9][0-9]%',d.[FILENAME]),0))) V(I);
Thom A
  • 88,727
  • 11
  • 45
  • 75