0

I will be receiving a bunch of full paths to directories and files and need to build up a directory structure. This structure will be stored in SQL Server using an adjacency list in a table defined as:

CREATE TABLE [dbo].[DirTreeEntry] 
(
    [id] [int] IDENTITY(1,1) NOT NULL,
    [full_path] [nvarchar](2048) NOT NULL,
    [name] [nvarchar](255) NOT NULL,
    [is_file] [bit] NOT NULL,
    [is_root] [bit] NOT NULL,
    [parent_id] [int] NULL,
    [source_id] [int] NULL
)

With id being the primary key, name in this case being the immediate name, not the full path, and source_id referencing the source table entry if it explicitly exists there.

The source data will be in this table:

CREATE TABLE [dbo].[dir_path] 
(
     [dir_path_id] [int] IDENTITY(1,1) NOT NULL,
     [directory_path] [nvarchar](2048) NOT NULL,
     [name] [nvarchar](255) NOT NULL,
     [file_flag] [bit] NOT NULL,
     [filesize] [bigint] NULL,
     [create_date] [datetime] NOT NULL
)

Where the full path is actually the combination of directory_path and name.

Given the following entries in [dbo].[dir_path]:

1, '/root/subdir1/subdir2', 'subdir3', 0, NULL, '9/9/2014'
2, '/root/subdir1/subdir2/subdir3', 'somefile.txt', 1, 25, '9/9/2014'
3, '/etc/rc.d', 'rc.local', 1, 10, '9/9/2014'

I need to end up with this in [dbo].[DirTreeEntry]:

1,'/','/',0,1, NULL,NULL
2,'/root','root',0,0,1,NULL
3,'/root/subdir1','subdir1',0,0,2,NULL
4,'/root/subdir1/subdir2','subdir2',0,0,3,NULL
5,'/root/subdir1/subdir2/subdir3','subdir3',0,0,4,1
6,'/root/subdir1/subdir2/subdir3/somefile.txt','somefile.txt',1,0,5,2
7,'/etc','etc',0,0,1,NULL
8,'/etc/rc.d','rc.d',0,0,7,NULL
9,'/etc/rc.d/rc.local','rc.local',1,0,8,3

The following code from How to create hierarchical structure with list of path? does exactly what I'm looking for as far as building the hierarchy in C#:

public class Node
{
   private readonly IDictionary<string, Node> _nodes = 
      new Dictionary<string, Node>();

   public string Path { get; set; }
}

public void AddPath(string path)
{
   char[] charSeparators = new char[] {'\\'};

   // Parse into a sequence of parts.
   string[] parts = path.Split(charSeparators, 
       StringSplitOptions.RemoveEmptyEntries);

   // The current node.  Start with this.
   Node current = this;

   // Iterate through the parts.
   foreach (string part in parts)
   {
       // The child node.
       Node child;

       // Does the part exist in the current node?  If
       // not, then add.
       if (!current._nodes.TryGetValue(part, out child))
       {
           // Add the child.
           child = new Node {
               Path = part
           };

           // Add to the dictionary.
           current._nodes[part] = child;
       }

       // Set the current to the child.
       current = child;
   }
}

However, I could be getting source data with 100,000+ entries and I don't want to have to build up that structure in memory on the C# side and then have to send all of that data to SQL. I already have a quick way to get the source data into the database and I'm now in need of a stored procedure to build up the [dbo].[DirTreeEntry] table based on the source data.

Any guidance would be appreciated!

Community
  • 1
  • 1
Tom
  • 1,179
  • 12
  • 28
  • `current._nodes.TryGetValue(part, out child)` is a `SELECT`, `current._nodes[part] = child` is an `INSERT`, and `current = child` is an UPDATE. So, you can leave the logic, including the recursion, in c#, and work with an SQL table of nodes instead of a dictionary of nodes. Then they wouldn't all be loaded in memory and you wouldn't have to send it *en-masse*. – Keith Payne Sep 09 '14 at 12:24
  • By the way, where is the recursion in the c# code? I don't see it. – Keith Payne Sep 09 '14 at 12:26
  • You are right, it is not recursive. I confused what that method did with something else I was working on. I will update the question. – Tom Sep 09 '14 at 12:29

1 Answers1

1

This should do the trick. It can be run more then once, because it doesn't add anything which already exists, but it doesn't deal with concurrent updates.

DECLARE @ParentID as int
DECLARE @Path as nvarchar(2048)
DECLARE @Name as nvarchar(255)
DECLARE @LastPart as nvarchar(255)
DECLARE @PartialPath as nvarchar(2048)

-- Loop through the input table
DECLARE paths CURSOR FOR SELECT p.directory_path, p.name FROM dir_path p;
OPEN paths;

FETCH NEXT FROM paths INTO @Path, @LastPart;
WHILE @@FETCH_STATUS = 0
BEGIN
    -- Reset loop variables
    SET @ParentID = NULL
    SET @PartialPath = '';

    -- Split the full path into parts and loop through those
    DECLARE parts CURSOR FOR SELECT Value FROM dbo.splitstring(@Path+'/'+@LastPart, '/');
    OPEN parts;

    FETCH NEXT FROM parts INTO @Name;
    WHILE @@FETCH_STATUS = 0
    BEGIN
        -- Build the path for this iteration
        SET @PartialPath = (CASE WHEN @PartialPath = '/' THEN '/'+@Name ELSE @PartialPath + '/' + @Name END);

        IF @Name = '' BEGIN 
        SET 
            @Name = '/' 
        END

        -- Insert the new path when it doesn't exist yet
        INSERT INTO DirTreeEntry (name, full_path, parent_id, is_file, is_root)
        SELECT @Name, @PartialPath, @ParentID, 0, 0
        WHERE NOT EXISTS (SELECT ID FROM DirTreeEntry WHERE full_path = @PartialPath)

        -- Store the id for the next level
        SELECT @ParentID = ID FROM DirTreeEntry WHERE full_path = @PartialPath;

        FETCH NEXT FROM parts INTO @Name;
    END
    CLOSE parts;
    DEALLOCATE parts;

    FETCH NEXT FROM paths INTO @Path,@LastPart;
END

CLOSE paths;
DEALLOCATE paths;

-- Update missing values in target table
UPDATE DirTreeEntry SET source_id = S.dir_path_id, is_file = S.file_flag
FROM DirTreeEntry T 
INNER JOIN dir_path S ON S.directory_path+'/'+S.name = T.full_path

UPDATE DirTreeEntry SET is_root = 1 WHERE full_path = '/'

EDIT: It can be done faster. Using an adaptation of this SplitString function:

CREATE FUNCTION [dbo].[SplitPath]
    (
        @List NVARCHAR(MAX)
    )
    RETURNS TABLE
    AS
        RETURN ( SELECT [Path], [Name] FROM 
          ( 
            SELECT 
              [Path] = LTRIM(RTRIM(SUBSTRING(@List, 0,
              (CASE WHEN CHARINDEX('/', @List + '/', [Number]) > 1 THEN CHARINDEX('/', @List + '/', [Number]) ELSE 2 END)))),
              [Name] = LTRIM(RTRIM(SUBSTRING(@List, [Number],
              (CASE WHEN CHARINDEX('/', @List + '/', [Number]) - [Number] > 0 THEN CHARINDEX('/', @List + '/', [Number])  - [Number] ELSE 1 END))))
            FROM (SELECT Number = ROW_NUMBER() OVER (ORDER BY name)
              FROM sys.all_objects) AS x
              WHERE Number <= LEN(@List)
              AND SUBSTRING('/' + @List, [Number], LEN('/')) = '/'
          ) AS y
        );

This spits out both the partial path and the name and fixes the '/' special case as well. This output can be cross-applied with the dir_path table creating the basic entries in DirTreeEntry. DirTreeEntry can then be enriched with the missing information.

INSERT DirTreeEntry (full_path, name, is_file, is_root)
SELECT DISTINCT d.Path, d.Name, 0, 0 FROM dir_path p 
CROSS APPLY dbo.SplitPath(directory_path+'/'+name) as d
ORDER BY d.Path

UPDATE DirTreeEntry SET source_id = S.dir_path_id, is_file = S.file_flag
FROM DirTreeEntry T 
INNER JOIN dir_path S ON S.directory_path+'/'+S.name = T.full_path

UPDATE DirTreeEntry SET is_root = 1 WHERE full_path = '/'

UPDATE A SET parent_id = B.id
FROM DirTreeEntry A INNER JOIN DirTreeEntry B ON CASE WHEN B.full_path = '/' THEN B.full_path + A.name ELSE B.full_path + '/' + A.name END = A.full_path

There may be room for further optimization, but this should be significantly faster.

Community
  • 1
  • 1
AVee
  • 3,348
  • 17
  • 17
  • This appears to be working so far and will probably be the accepted answer but I'm leaving question open a little longer to see if there's any clever set based approaches. Right now this is running soooo slow and I even have the final updates commented out to speed it up for now. Any optimization suggestions? – Tom Sep 10 '14 at 10:19
  • I didn't look at performance at all, and I don't have a big set to test against. At least you should make the cursors FAST_FORWARD. The loops and the final update would benefit from an index on full_path, that might make quite a difference already. – AVee Sep 10 '14 at 10:46
  • I have already added FAST_FORWARD to the cursors but I could not make an index on full_path because it exceeds 900bytes which is a SQL Server index limitation. I did however make one on "Name" and added that to the where clauses to narrow down the scan. This was a noticeable performance increase on a small set but my current run against a potential data set of around 150,000 is at 42min. and counting. – Tom Sep 10 '14 at 11:23
  • That is the same SplitString function that I am using. – Tom Sep 10 '14 at 16:03
  • I've added an optimized version, half a minute for 10k records on my system. Perhaps there's still room for improvement, but I'm done here :-) – AVee Sep 10 '14 at 16:44
  • The optimized version is actually running a lot slower for me than the original. Not sure why but I will continue to look into it. Either way, thanks very much for your help! – Tom Sep 11 '14 at 12:40