2

The Problem

I have a number of filename strings that I want to parse into columns using a tilda as delimiter. The strings take on the static format:

  • Filepath example C:\My Documents\PDF
  • Surname example Walker
  • First Name example Thomas
  • Birth Date example 19991226
  • Document Created Datetime example 20180416150322
  • Document Extension example .pdf

So a full concatenated example would be something like:

C:\My Documents\PDF\Walker~Thomas~19991226~20180416150322.pdf

I want to ignore the file path and extension given in the string and only parse the following values into columns:

  • Surname, First Name, Birth Date, Document Created Datetime

So something like:

SELECT Surname = --delimitedString[0]
       FirstName = --delimitedString[1]
       --etc.

What I have tried

I know that I have several tasks I would need to perform in order to split the string, first I would need to trim off the extension and file path so that I can return a string delimited by tildas (~).

This is problem one for me, however problem 2 is splitting the new delimted string itself i.e.

Walker~Thomas~19991226~20180416150322

Ive had a good read through this very comprehensive question and It seems (as im using SQL Server 2008R2) the only options are to use either a function with loops or recursive CTE's or attempt a very messy attempt using SUBSTRING() with charIndex().

Im aware that If I had access to SQL Server 2016 I could use string_split but unfortunately I cant upgrade.

I do have access to SSIS but im very new to it so decided to attempt the bulk of the work within a SQL statement

Master Yoda
  • 4,334
  • 10
  • 43
  • 77
  • Don't use that code. It's not good on performance. These two blogs will server you far better. [Jeff Moden Splitter](http://www.sqlservercentral.com/articles/Tally+Table/72993/) and [Aaron Bertrand spit strings the right way](https://sqlperformance.com/2012/07/t-sql-queries/split-strings) – S3S Apr 16 '18 at 20:00
  • Why do you want to do this in TSQL instead of C# or Java or something similar? It's more appropriate and easier to do there. – user2023861 Apr 16 '18 at 20:01
  • What is the issue with charIndex() outside of code length? It seems like the most reasonable option. – Fritz Apr 16 '18 at 20:02
  • @scsimon Thanks for the links, im going to have a read through them now. – Master Yoda Apr 16 '18 at 20:05
  • @user2023861I have access to SSIS but wont be able to make use of .NET unfortunately, I just need to manipulate the string so that I can parse out the variables I need to use elsewhere. I updated the question to include SSIS as tags – Master Yoda Apr 16 '18 at 20:06
  • 1
    @Fritz Im concerned that Substring() along with CharIndex() can quickly turn into very unreadable code. If its my only option then i guess ill have to explore that route but for now im open to suggestions – Master Yoda Apr 16 '18 at 20:07
  • @MasterYoda I agree it’ll be harder to read and I’m not sure which would be better performance in your environment if the charindex does work better I’d just comment the code throughly – S3S Apr 16 '18 at 20:10
  • What is the role of SSIS here? Are you using it to insert these strings in the first place? If so it should be trivial to do the splitting there. – Martin Smith Apr 16 '18 at 20:14
  • @MasterYoda, all the strings only have 4 parts? if that's true, you may use substring() + PARSENAME() , it would be easier – EricZ Apr 16 '18 at 20:16
  • @MartinSmith Currently im using SSIS to open a connection to the DB and retrieve a filename for each row. From there I want to manipulate the filename to parse the string into multiple columns in my select statement for each row – Master Yoda Apr 16 '18 at 20:16
  • @EricZ Technically yes, but I also have to trim off the filepath at the beginning and the extension at the end – Master Yoda Apr 16 '18 at 20:18

3 Answers3

3

Here is a way without a splitter that shouldn't be too complicated...

declare @var table (filepath varchar(256))
insert into @var values
('C:\My Documents\PDF\Walker~Thomas~19991226~20180416150322.pdf')


;with string as(
select 
    x = right(filepath,charindex('\',reverse(filepath))-1) 
    from @var
)

select
    SurName= substring(x,1,charindex('~',x) - 1)
    ,FirstName = substring(x,charindex('~',x) + 1,charindex('~',x) - 1)
from string
S3S
  • 24,809
  • 5
  • 26
  • 45
  • reverse tends to be expensive, but the string isn't terribly long so that's good. – S3S Apr 16 '18 at 20:34
  • @scsimon thanks for the example, out of the three answers im just wondering which would be least expensive on a larger dataset – Master Yoda Apr 16 '18 at 20:42
  • @MasterYoda the only way to tell would be to run them each, and record the times, on the *same* data set in your environment. We could do the same, but we'd need a realistic data set. Additionally, splitting the string still may be the best route via the links i set. If you choose to bench mark this, be sure to [clear the query plan cache](https://www.sqlskills.com/blogs/glenn/eight-different-ways-to-clear-the-sql-server-plan-cache/) so that you are using a fresh plan each time, versus a cached version which would speed up successive runs. – S3S Apr 16 '18 at 20:49
  • Additionally, keep in mind that if it's an argument between milliseconds, or even a few seconds, the difference could be in the `CTE` most of us used, which could be sped up by using a TEMP TABLE perhaps, or a derived table. Just be sure to check the query plan and if you have questions on that, and *why* one is faster than another, I'd post that on dba.stackexchange.com and be sure to add your query plan via https://www.brentozar.com/pastetheplan/ – S3S Apr 16 '18 at 20:51
  • 1
    @scsimon Thats 100% true, I guess ill have to test each one individually on a bigger dataset tomorrow. Thank you for your time and advice its very much appreciated. – Master Yoda Apr 16 '18 at 20:52
2

I know you mentioned wanting to avoid the charindex() option if at all possible, but I worked it out in a hopefully semi-readable way. I find it somewhat easy to read complex functions like this when I space each parameter on a different line and use indent levels. It's not the most proper looking, but it helps with legibility:

with string as (select 'C:\My Documents\PDF\Walker~Thomas~19991226~20180416150322.pdf' as filepath)

select 
    substring(
        filepath,
        len(filepath)-charindex('\',reverse(filepath))+2,           --start location, after last '\'
        len(filepath)-                                              --length of path
            (len(filepath)-charindex('\',reverse(filepath))+2)-         --less characters up to last '\'
            (len(filepath)-charindex('.',filepath))                     --less file extention
        )
from string
Fritz
  • 624
  • 1
  • 7
  • 14
  • Thanks Fritz, part way there. That gives me the delimited string without the filepath and extension. Now i just need to parse the delimited string – Master Yoda Apr 16 '18 at 20:21
  • Oh, whoops. I totally forgot the splitting by tilde part. Is there a consistent number of tilde characters in each possible piece of data or does it vary? – Fritz Apr 16 '18 at 20:23
  • There will always be four parts: surname, forename, birthdate, document creation date – Master Yoda Apr 16 '18 at 20:24
  • Thanks Fritz, Im going to go with your answer as it was a great start followed up by EricZ's continuation. Nice job. I appreciate all of the help given by all of you and will be taking Simons advice on the performance side of things – Master Yoda Apr 17 '18 at 11:15
2

Fritz already have a great start, my answer just add on top it

with string as (select 'C:\My Documents\PDF\Walker~Thomas~19991226~20180416150322.pdf' as filepath)
, newstr as (
select 
    REPLACE(substring(
        filepath,
        len(filepath)-charindex('\',reverse(filepath))+2,           --start location, after last '\'
        len(filepath)-                                              --length of path
            (len(filepath)-charindex('\',reverse(filepath))+2)-         --less characters up to last '\'
            (len(filepath)-charindex('.',filepath))                     --less file extention
        ) , '~', '.') as new_part
from string
)
SELECT
      PARSENAME(new_part,4) as Surname,
      PARSENAME(new_part,3) as [First Name],
      PARSENAME(new_part,2) as [Birth Date], 
      PARSENAME(new_part,1) as [Document Created Datetime]
FROM newstr
EricZ
  • 6,065
  • 1
  • 30
  • 30