Convert C# code to TSQL Stored Procedure

Question

I will be receiving a bunch of full paths to directories and files and need to build up a directory structure. This structure will be stored in SQL Server using an adjacency list in a table defined as:

CREATE TABLE [dbo].[DirTreeEntry] 
(
    [id] [int] IDENTITY(1,1) NOT NULL,
    [full_path] [nvarchar](2048) NOT NULL,
    [name] [nvarchar](255) NOT NULL,
    [is_file] [bit] NOT NULL,
    [is_root] [bit] NOT NULL,
    [parent_id] [int] NULL,
    [source_id] [int] NULL
)

With id being the primary key, name in this case being the immediate name, not the full path, and source_id referencing the source table entry if it explicitly exists there.

The source data will be in this table:

CREATE TABLE [dbo].[dir_path] 
(
     [dir_path_id] [int] IDENTITY(1,1) NOT NULL,
     [directory_path] [nvarchar](2048) NOT NULL,
     [name] [nvarchar](255) NOT NULL,
     [file_flag] [bit] NOT NULL,
     [filesize] [bigint] NULL,
     [create_date] [datetime] NOT NULL
)

Where the full path is actually the combination of directory_path and name.

Given the following entries in [dbo].[dir_path]:

1, '/root/subdir1/subdir2', 'subdir3', 0, NULL, '9/9/2014'
2, '/root/subdir1/subdir2/subdir3', 'somefile.txt', 1, 25, '9/9/2014'
3, '/etc/rc.d', 'rc.local', 1, 10, '9/9/2014'

I need to end up with this in [dbo].[DirTreeEntry]:

1,'/','/',0,1, NULL,NULL
2,'/root','root',0,0,1,NULL
3,'/root/subdir1','subdir1',0,0,2,NULL
4,'/root/subdir1/subdir2','subdir2',0,0,3,NULL
5,'/root/subdir1/subdir2/subdir3','subdir3',0,0,4,1
6,'/root/subdir1/subdir2/subdir3/somefile.txt','somefile.txt',1,0,5,2
7,'/etc','etc',0,0,1,NULL
8,'/etc/rc.d','rc.d',0,0,7,NULL
9,'/etc/rc.d/rc.local','rc.local',1,0,8,3

The following code from How to create hierarchical structure with list of path? does exactly what I'm looking for as far as building the hierarchy in C#:

public class Node
{
   private readonly IDictionary<string, Node> _nodes = 
      new Dictionary<string, Node>();

   public string Path { get; set; }
}

public void AddPath(string path)
{
   char[] charSeparators = new char[] {'\\'};

   // Parse into a sequence of parts.
   string[] parts = path.Split(charSeparators, 
       StringSplitOptions.RemoveEmptyEntries);

   // The current node.  Start with this.
   Node current = this;

   // Iterate through the parts.
   foreach (string part in parts)
   {
       // The child node.
       Node child;

       // Does the part exist in the current node?  If
       // not, then add.
       if (!current._nodes.TryGetValue(part, out child))
       {
           // Add the child.
           child = new Node {
               Path = part
           };

           // Add to the dictionary.
           current._nodes[part] = child;
       }

       // Set the current to the child.
       current = child;
   }
}

However, I could be getting source data with 100,000+ entries and I don't want to have to build up that structure in memory on the C# side and then have to send all of that data to SQL. I already have a quick way to get the source data into the database and I'm now in need of a stored procedure to build up the [dbo].[DirTreeEntry] table based on the source data.

Any guidance would be appreciated!

`current._nodes.TryGetValue(part, out child)` is a `SELECT`, `current._nodes[part] = child` is an `INSERT`, and `current = child` is an UPDATE. So, you can leave the logic, including the recursion, in c#, and work with an SQL table of nodes instead of a dictionary of nodes. Then they wouldn't all be loaded in memory and you wouldn't have to send it *en-masse*. — Keith Payne, Sep 09 '14 at 12:24
By the way, where is the recursion in the c# code? I don't see it. — Keith Payne, Sep 09 '14 at 12:26
You are right, it is not recursive. I confused what that method did with something else I was working on. I will update the question. — Tom, Sep 09 '14 at 12:29

score 1 · Accepted Answer · edited May 23 '17 at 12:21

This should do the trick. It can be run more then once, because it doesn't add anything which already exists, but it doesn't deal with concurrent updates.

DECLARE @ParentID as int
DECLARE @Path as nvarchar(2048)
DECLARE @Name as nvarchar(255)
DECLARE @LastPart as nvarchar(255)
DECLARE @PartialPath as nvarchar(2048)

-- Loop through the input table
DECLARE paths CURSOR FOR SELECT p.directory_path, p.name FROM dir_path p;
OPEN paths;

FETCH NEXT FROM paths INTO @Path, @LastPart;
WHILE @@FETCH_STATUS = 0
BEGIN
    -- Reset loop variables
    SET @ParentID = NULL
    SET @PartialPath = '';

    -- Split the full path into parts and loop through those
    DECLARE parts CURSOR FOR SELECT Value FROM dbo.splitstring(@Path+'/'+@LastPart, '/');
    OPEN parts;

    FETCH NEXT FROM parts INTO @Name;
    WHILE @@FETCH_STATUS = 0
    BEGIN
        -- Build the path for this iteration
        SET @PartialPath = (CASE WHEN @PartialPath = '/' THEN '/'+@Name ELSE @PartialPath + '/' + @Name END);

        IF @Name = '' BEGIN 
        SET 
            @Name = '/' 
        END

        -- Insert the new path when it doesn't exist yet
        INSERT INTO DirTreeEntry (name, full_path, parent_id, is_file, is_root)
        SELECT @Name, @PartialPath, @ParentID, 0, 0
        WHERE NOT EXISTS (SELECT ID FROM DirTreeEntry WHERE full_path = @PartialPath)

        -- Store the id for the next level
        SELECT @ParentID = ID FROM DirTreeEntry WHERE full_path = @PartialPath;

        FETCH NEXT FROM parts INTO @Name;
    END
    CLOSE parts;
    DEALLOCATE parts;

    FETCH NEXT FROM paths INTO @Path,@LastPart;
END

CLOSE paths;
DEALLOCATE paths;

-- Update missing values in target table
UPDATE DirTreeEntry SET source_id = S.dir_path_id, is_file = S.file_flag
FROM DirTreeEntry T 
INNER JOIN dir_path S ON S.directory_path+'/'+S.name = T.full_path

UPDATE DirTreeEntry SET is_root = 1 WHERE full_path = '/'

EDIT: It can be done faster. Using an adaptation of this SplitString function:

CREATE FUNCTION [dbo].[SplitPath]
    (
        @List NVARCHAR(MAX)
    )
    RETURNS TABLE
    AS
        RETURN ( SELECT [Path], [Name] FROM 
          ( 
            SELECT 
              [Path] = LTRIM(RTRIM(SUBSTRING(@List, 0,
              (CASE WHEN CHARINDEX('/', @List + '/', [Number]) > 1 THEN CHARINDEX('/', @List + '/', [Number]) ELSE 2 END)))),
              [Name] = LTRIM(RTRIM(SUBSTRING(@List, [Number],
              (CASE WHEN CHARINDEX('/', @List + '/', [Number]) - [Number] > 0 THEN CHARINDEX('/', @List + '/', [Number])  - [Number] ELSE 1 END))))
            FROM (SELECT Number = ROW_NUMBER() OVER (ORDER BY name)
              FROM sys.all_objects) AS x
              WHERE Number <= LEN(@List)
              AND SUBSTRING('/' + @List, [Number], LEN('/')) = '/'
          ) AS y
        );

This spits out both the partial path and the name and fixes the '/' special case as well. This output can be cross-applied with the dir_path table creating the basic entries in DirTreeEntry. DirTreeEntry can then be enriched with the missing information.

INSERT DirTreeEntry (full_path, name, is_file, is_root)
SELECT DISTINCT d.Path, d.Name, 0, 0 FROM dir_path p 
CROSS APPLY dbo.SplitPath(directory_path+'/'+name) as d
ORDER BY d.Path

UPDATE DirTreeEntry SET source_id = S.dir_path_id, is_file = S.file_flag
FROM DirTreeEntry T 
INNER JOIN dir_path S ON S.directory_path+'/'+S.name = T.full_path

UPDATE DirTreeEntry SET is_root = 1 WHERE full_path = '/'

UPDATE A SET parent_id = B.id
FROM DirTreeEntry A INNER JOIN DirTreeEntry B ON CASE WHEN B.full_path = '/' THEN B.full_path + A.name ELSE B.full_path + '/' + A.name END = A.full_path

There may be room for further optimization, but this should be significantly faster.

This appears to be working so far and will probably be the accepted answer but I'm leaving question open a little longer to see if there's any clever set based approaches. Right now this is running soooo slow and I even have the final updates commented out to speed it up for now. Any optimization suggestions? — Tom, Sep 10 '14 at 10:19
I didn't look at performance at all, and I don't have a big set to test against. At least you should make the cursors FAST_FORWARD. The loops and the final update would benefit from an index on full_path, that might make quite a difference already. — AVee, Sep 10 '14 at 10:46
I have already added FAST_FORWARD to the cursors but I could not make an index on full_path because it exceeds 900bytes which is a SQL Server index limitation. I did however make one on "Name" and added that to the where clauses to narrow down the scan. This was a noticeable performance increase on a small set but my current run against a potential data set of around 150,000 is at 42min. and counting. — Tom, Sep 10 '14 at 11:23
I've added an optimized version, half a minute for 10k records on my system. Perhaps there's still room for improvement, but I'm done here :-) — AVee, Sep 10 '14 at 16:44
The optimized version is actually running a lot slower for me than the original. Not sure why but I will continue to look into it. Either way, thanks very much for your help! — Tom, Sep 11 '14 at 12:40

Convert C# code to TSQL Stored Procedure

1 Answers1