Count total and unique filenames embedded in string

Question

I have a table with a column called ExcelLinks that contain records like this:

=INDEX('\\san1\engData[BT_500.0_Structural_Position.xls]Concrete'!$B$4:$IK$83,MATCH($K$9,'\\san1\engData[BT_500.0_Structural_Position.xls]Concrete'!$A$4:$A$83,0),MATCH(C212,'\\san1\engData[BT_500.0_Structural_Position.xls]Concrete'!$B$3:$IK$3,0))/1000000

=INDEX('\\san1\engData[GK_600.0_Pumps.xls]Pumps'!$B$4:$BD$39,MATCH($K$9,'\\san1\engData[TT_640.0_Generator.xls]Generator'!$A$4:$A$39,0),MATCH(C214,'\\san1\engData[GK_600.0_Pumps.xls]Pumps'!$B$3:$BD$3,0))/1000000

=INDEX('\\san1\engData[TT_640.0_Generator.xls]Generator'!$B$4:$HU$83,MATCH($K$9,'\\san1\engData[GK_600.0_Pumps.xls]Pumps'!$A$4:$A$83,0),MATCH(C218,'\\san1\engData[TT_640.0_Generator.xls]Generator'!$B$3:$HU$3,0))/1000000

The ideal output would be:

_______________________________________
| Row  |  LinkCount |  UniqueLinkCount |
| 1    |     3      |        1         |
| 2    |     3      |        2         |
| 3    |     3      |        2         |

I want to query this data and see the number of files and unique files used per record.

I did a search online and couldn't find anything that does this.

I'm thinking I'll make a cursor and for each record I'll detect chars starting with \\ and ending with '!$ and count the number of files.

The hard bit is the ExcelLinks with the =INDEX and MATCH functions that use multiple interlinks (that could be different files).

There's over 12 million records in this table so I am concerned about the performance using a cursor.

There are some better ways to do this with Oracle using RegEx's. I know that SQL Server doesn't have RegEx and am willing to write/use a CLR stored proc if that's the easiest option.

Is it essentially every time you see \\ that counts as a file? SQL Server doesn't support RegEx natively but you can implement it using CLR (and you certainly don't need a cursor for this in any event). — Aaron Bertrand, Jul 04 '12 at 01:27
Thanks @Aaron, I've tried to clarify exactly what I want and I dont think I could pull it off counting the number of `\\`'s. — Jeremy Thompson, Jul 04 '12 at 01:40

score 3 · Accepted Answer · edited Nov 13 '18 at 22:54

First, grab this string splitting CLR function from Adam Machanic. Compile the code into a DLL (using csc if you don't have Visual Studio), copy the DLL to your server, and then register the DLL as follows (you'll have to replace some variable parts here, such as the file path, what you want to call the assembly, etc.):

CREATE ASSEMBLY CLRStuff 
  FROM 'C:\DLLs\CLRStuff.dll'  
  WITH PERMISSION_SET = SAFE;
GO

CREATE FUNCTION dbo.SplitStrings
(
   @List      NVARCHAR(MAX),
   @Delimiter NVARCHAR(255)
)
RETURNS TABLE ( Item NVARCHAR(4000) )
  EXTERNAL NAME CLRStuff.UserDefinedFunctions.SplitString_Multi;
GO

With that in place, the query itself is quite easy. Let's create a simple table variable holding a few rows (I shortened the paths for brevity):

DECLARE @x TABLE(i INT, ExcelLink VARCHAR(MAX));

INSERT @x

    -- 3 files, 1 unique: 
    SELECT 1,'=INDEX(''\\san1\a.xls''!$B$4:$IK$83,MATCH($K$9,''\\san1\a.xls'
    + '''!$A$4:$A$83,0),MATCH(C212,''\\san1\a.xls''!$B$3:$IK$3,0))/1000000'

UNION ALL 

    -- 3 files, 3 unique:
    SELECT 2,'=INDEX(''\\san1\a.xls''!$B$4:$BD$39,MATCH($K$9,''\\san1\b.xls'
    + '''!$A$4:$A$39,0),MATCH(C214,''\\san1\c.xls''!$B$3:$BD$3,0))/1000000'

UNION ALL 

    -- 3 files, 2 unique:
    SELECT 3,'=INDEX(''\\san1\b.xls''!$B$4:$HU$83,MATCH($K$9,''\\san1\c.xls'
    + '''!$A$4:$A$83,0),MATCH(C218,''\\san1\c.xls''!$B$3:$HU$3,0))/1000000'

UNION ALL 

    -- 1 file, 1 unique:
    SELECT 4,'=INDEX(''\\san1\foo.xls''!$B$4:$HU$83,0)';

-- the above was just inserts; the remainder is all of the query:

;WITH x(i,part) AS 
(
  SELECT x.i, SUBSTRING(t.Item, CHARINDEX('''\\', t.Item), 2048) 
    FROM @x AS x CROSS APPLY dbo.SplitStrings(x.ExcelLink, '!$') AS t
)
SELECT i, [file_count] = COUNT(part), [unique_files] = COUNT(DISTINCT part)
  FROM x WHERE part LIKE '''\\%'
  GROUP BY i ORDER BY i;

Results:

i   file_count  unique_files
--  ----------  ------------
1   3           1
2   3           3
3   3           2
4   1           1

This relies on \\ not appearing naturally in the data other than as the beginning of a file path, and that all file paths reside on a network share.

This is probably not the most efficient you can get - I'm sure some RegEx wizard can improve this using that approach instead of splitting (here is a good article to get you started), but that's not my forte. A large portion of the cost is going to be the I/O required to scan the entire table, rather than the counting or the replacing.

If you can't use CLR, you can substitute that function for any number of non-CLR versions (here is an example that would be a functionally suitable replacement), but keep in mind other approaches will likely suffer from less optimal performance.

That is so awesome, I copied the [string splitting CLR function from Adam Machanic](http://sqlblog.com/blogs/adam_machanic/archive/2009/04/28/sqlclr-string-splitting-part-2-even-faster-even-more-scalable.aspx), put it in a Class Library project > Compiled > put the DLL on the server and got it working. The only trick was targeting pre .Net 4.0 and then it registered a treat. Thanks very much. — Jeremy Thompson, Jul 04 '12 at 02:43

Count total and unique filenames embedded in string

1 Answers1