1

I need to find a continuous 6 or 7 digit number in a string from column name Filename. The string has other numbers in it with dashes(or another character, like an underscore), but I only need the continuous number

The StudentID needs to be extracted from the filename. (I know the data is just wow, multiple vendors, multiple file naming formats is the cause.) Another option would be to just list the starting position of the continuous number.

Desired outcome:

enter image description here

Actual outcome:

enter image description here

Test Code:

DROP TABLE #StuID

CREATE TABLE #StuID (
 FILENAME VARCHAR(MAX)
,StudentID INT
)

INSERT INTO #StuID
( FILENAME  )
VALUES
 ('Smith John D, 11-23-1980, 1234567.pdf')
,('Doe Jane, _01_22_1980_123456.pdf')
,('John Doe, 567891.pdf' )

--This is what I tried.

SELECT FILENAME
, substring(FileName, patindex('%[0-9][0-9][0-9][0-9][0-9][0-9]%', FileName), 8) AS StudentID
FROM #StuID
JM1
  • 1,595
  • 5
  • 19
  • 41
  • is it possible (I would assume yes) that one day you will get a file the dates in the filename not formatted with dashes or underscore character? If so, the solutions below will give you bad data. They will find a YYYYMMDD date or mmddyyyy date in your files before the student ID... if anyone were ever to send you a file named as such. It seems like the naming convention for the files is kinda loose. So, if that can happen the solutions so far will get tripped up. – jamie Aug 05 '19 at 23:43
  • I assume, that various students send in their homework and all of them land within a folder. Every student tries his best to name the file in the one or the other sensefull manner, but they do not stick to any naming rule provided by their professors ;-) – Shnugo Aug 06 '19 at 08:29
  • This is actually a one time, inherited issue. I the issue was the file names came from using different vendors without someone ensuring a standardized file name. Once all of the data needed from the file name is extracted, this hopefully won't be an issue anymore. Thanks to everyone for being kind and also helping! – JM1 Aug 06 '19 at 10:34

3 Answers3

3

Because you want 6 or 7 digits, case might be the simplest solution:

SELECT FILENAME,
       (CASE WHEN FileName LIKE '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9]%'
             THEN substring(FileName, patindex('%[0-9][0-9][0-9][0-9][0-9][0-9]%', FileName), 7)
             WHEN FileName LIKE '%[0-9][0-9][0-9][0-9][0-9][0-9]%'
             THEN substring(FileName, patindex('%[0-9][0-9][0-9][0-9][0-9]%', FileName), 6)
        END) AS StudentID                 
FROM #StuID
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thank you @GordonLinoff! I appreciate the original code that helped me start this, and also this answer that helps me finish this issue. I'm grateful for your help. – JM1 Aug 06 '19 at 10:55
2

Another approach I like a lot is a cast to XML and a XQuery filter:

WITH Casted([FileName],ToXml) AS
(
    SELECT [FILENAME] 
          ,CAST('<x>' + REPLACE(REPLACE(REPLACE([FILENAME],' ','</x><x>'),'.','</x><x>'),'_','</x><x>') + '</x>' AS XML)
    FROM #StuID
)
SELECT [FileName] 
      ,numbers.value('text()[1]','int')
FROM Casted
CROSS APPLY ToXml.nodes('/x[not(empty(. cast as xs:int?))]') A(numbers);

This will split the string in its fragments and return all fragments, which are numbers.

You can easily reduce the set to StudentIDs by using any convenient WHERE clause or you add to the XQuery filter the length of 6 or 7:

CROSS APPLY ToXml.nodes('/x[not(empty(. cast as xs:int?)) 
                            and (string-length(.)=6 or string-length(.)=7)]') A(numbers)

EDIT

This would be most on point:

CROSS APPLY ToXml.nodes('/x[. cast as xs:int? >= 100000 and . cast as xs:int? <10000000]') A(numbers)
Shnugo
  • 66,100
  • 9
  • 53
  • 114
  • Thank you @Shnugo! This does a great job breaking everything out. How would you modify it to pull out just the date as well? – JM1 Aug 06 '19 at 10:58
  • 1
    @JM1, As this is an entirely new question (and there is an accepted answer already) I'd ask you to start a new question and place a link here. – Shnugo Aug 06 '19 at 11:00
  • Thanks again for your help and suggestion, I put the new question here: https://stackoverflow.com/questions/57374764/tsql-find-a-date-with-varying-characters-in-a-string – JM1 Aug 06 '19 at 11:15
1

If you know that filetype is pdf then:

SELECT FILENAME
, substring(REPLACE(FileName, '.pdf',''), patindex('%[0-9][0-9][0-9][0-9][0-9][0-9]%', FileName), 8)
  AS StudentID
FROM #StuID;

db<>fiddle demo


More generic one (SQL Server 2017):

SELECT FILENAME
, substring(s.c, patindex('%[0-9][0-9][0-9][0-9][0-9][0-9]%', s.c), 8) AS StudentID
FROM #StuID
CROSS APPLY (SELECT trim(' !"#$%&\''()*+,-./:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~' FROM filename) AS c) s

db<>fiddle demo2

Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
  • Thank you @LukaszSzozda! I appreciate your help. The .pdf answer you gave works well. The second answer gives me a multipart identifier error on the s.c part of the substring when hovering over it and when I run it, it says there's an incorrect syntax near the keyword 'FROM'. It works in SQL Fiddle though. It is on a 2016 server, not sure if that has anything to do with it.. Thank you for posting both solutions! – JM1 Aug 06 '19 at 11:04