Recursive self join over file data

Question

I know there are many questions about recursive self joins, but they're mostly in a hierarchical data structure as follows:

 ID   |   Value    | Parent id
 -----------------------------

But I was wondering if there was a way to do this in a specific case that I have where I don't necessarily have a parent id. My data will look like this when I initially load the file.

 ID   |  Line            | 
 -------------------------
 1    | 3,Formula,1,2,3,4,...
 2    | *,record,abc,efg,hij,...
 3    | ,,1,x,y,z,...
 4    | ,,2,q,r,s,...
 5    | 3,Formula,5,6,7,8,...
 6    | *,record,lmn,opq,rst,...
 7    | ,,1,t,u,v,...
 8    | ,,2,l,m,n,...

Essentially, its a CSV file where each row in the table is a line in the file. Lines 1 and 5 identify an object header and lines 3, 4, 7, and 8 identify the rows belonging to the object. The object header lines can have only 40 attributes which is why the object is broken up across multiple sections in the CSV file.

What I'd like to do is take the table, separate out the record # column, and join it with itself multiple times so it achieves something like this:

 ID   |  Line            | 
 -------------------------
 1    | 3,Formula,1,2,3,4,5,6,7,8,...
 2    | *,record,abc,efg,hij,lmn,opq,rst
 3    | ,,1,x,y,z,t,u,v,...
 4    | ,,2,q,r,s,l,m,n,...

I know its probably possible, I'm just not sure where to start. My initial idea was to create a view that separates out the first and second columns in a view, and use the view as a way of joining in a repeated fashion on those two columns. However, I have some problems:

I don't know how many sections will occur in the file for the same object
The file can contain other objects as well so joining on the first two columns would be problematic if you have something like

 ID   |  Line            | 
 -------------------------
 1    | 3,Formula,1,2,3,4,...
 2    | *,record,abc,efg,hij,...
 3    | ,,1,x,y,z,...
 4    | ,,2,q,r,s,...
 5    | 3,Formula,5,6,7,8,...
 6    | *,record,lmn,opq,rst,...
 7    | ,,1,t,u,v,...
 8    | ,,2,l,m,n,...
 9    | ,4,Data,1,2,3,4,...
 10   | *,record,lmn,opq,rst,...
 11   | ,,1,t,u,v,...

In the above case, my plan could join rows from the Data object in row 9 with the first rows of the Formula object by matching the record value of 1.

UPDATE

I know this is somewhat confusing. I tried doing this with C# a while back, but I had to basically write a recursive decent parser to parse the specific file format and it simply took to long because I had to get it in the database afterwards and it was too much for entity framework. It was taking hours just to convert one file since these files are excessively large.

Either way, @Nolan Shang has the closest result to what I want. The only difference is this (sorry for the bad formatting):

+----+------------+------------------------------------------+-----------------------+
| ID | header     | x                                        | value                                         
|
+----+------------+------------------------------------------+-----------------------+
| 1  | 3,Formula, | ,1,2,3,4,5,6,7,8                         |3,Formula,1,2,3,4,5,6,7,8                     |
| 2  | ,,         | ,1,x,y,z,t,u,v                           | ,1,x,y,z,t,u,v                    |
| 3  | ,,         | ,2,q,r,s,l,m,n                           | ,2,q,r,s,l,m,n                    | 
| 4  | *,record,  | ,abc,efg,hij,lmn,opq,rst             |*,record,abc,efg,hij,lmn,opq,rst           |
| 5  | ,4,        | ,Data,1,2,3,4                            |,4,Data,1,2,3,4                               |
| 6  | *,record,  | ,lmn,opq,rst                             | ,lmn,opq,rst                                  |
| 7  | ,,         | ,1,t,u,v                                 | ,1,t,u,v                      |
+----+------------+------------------------------------------+-----------------------------------------------+

I'm sure someone will post *a way* to do this, but I would consider handling this with a scripting language outside of SQL. — S3S, Apr 26 '18 at 14:31
I cant' follow the logic. For instance, you have "separate out the record # column,". There is no `record #` column in the data as far as I can see. — Gordon Linoff, Apr 26 '18 at 14:35
@GordonLinoff There is. The record number column is in the CSV list in the file line. In line 6, you see *,record,lmn... which means the record number placement is in the second column. Line 7 and 8 are the rows for object in line 6. So in line 7 and 8, the record # would be 1 and 2 respectively. — Jake, Apr 26 '18 at 14:53
That is a lot of trust in commas, tell me, is there never the case where I could have data that includes an oxford comma? You know in case someone is helping their uncle Jack, off his horse... — Random_User, Apr 26 '18 at 14:55

Tab Alleman · Answer 1 · 2018-04-26T15:23:15.023

I agree that it would be better to export this to a scripting language and do it there. This will be a lot of work in TSQL.

You've intimated that there are other possible scenarios you haven't shown, so I obviously can't give a comprehensive solution. I'm guessing this isn't something you need to do quickly on a repeated basis. More of a one-time transformation, so performance isn't an issue.

One approach would be to do a LEFT JOIN to a hard-coded table of the possible identifying sub-strings like:

3,Formula,
*,record,
,,1,
,,2,
,4,Data,

Looks like it pretty much has to be human-selected and hard-coded because I can't find a reliable pattern that can be used to SELECT only these sub-strings.

Then you SELECT from this artificially-created table (or derived table, or CTE) and LEFT JOIN to your actual table with a LIKE to get all the rows that use each of these values as their starting substring, strip out the starting characters to get the rest of the string, and use the STUFF..FOR XML trick to build the desired Line.

How you get the ID column depends on what you want, for instance in your second example, I don't know what ID you want for the ,4,Data,... line. Do you want 5 because that's the next number in the results, or do you want 9 because that's the ID of the first occurrance of that sub-string? Code accordingly. If you want 5 it's a ROW_NUMBER(). If you want 9, you can add an ID column to the artificial table you created at the start of this approach.

BTW, there's really nothing recursive about what you need done, so if you're still thinking in those terms, now would be a good time to stop. This is more of a "Group Concatenation" problem.

Nolan Shang · Answer 2 · 2018-04-26T15:36:06.357

Here is a sample, but has some different with you need. It is because I use the value the second comma as group header, so the ,,1 and ,,2 will be treated as same group, if you can use a parent id to indicated a group will be better

    DECLARE  @testdata TABLE(ID int,Line varchar(8000))
    INSERT INTO @testdata
        SELECT 1,'3,Formula,1,2,3,4,...' UNION ALL 
        SELECT 2,'*,record,abc,efg,hij,...'  UNION ALL 
        SELECT 3,',,1,x,y,z,...'  UNION ALL 
        SELECT 4,',,2,q,r,s,...'  UNION ALL 
        SELECT 5,'3,Formula,5,6,7,8,...'  UNION ALL 
        SELECT 6,'*,record,lmn,opq,rst,...'  UNION ALL 
        SELECT 7,',,1,t,u,v,...'  UNION ALL 
        SELECT 8,',,2,l,m,n,...'  UNION ALL 
        SELECT 9,',4,Data,1,2,3,4,...'  UNION ALL 
        SELECT 10,'*,record,lmn,opq,rst,...'  UNION ALL 
        SELECT 11,',,1,t,u,v,...'
    ;WITH t AS(
    SELECT *,REPLACE(SUBSTRING(t.Line,LEN(c.header)+1,LEN(t.Line)),',...','')  AS data
    FROM @testdata AS t
    CROSS APPLY(VALUES(LEFT(t.Line,CHARINDEX(',',t.Line, CHARINDEX(',',t.Line)+1 )))) c(header)
    )
    SELECT MIN(ID) AS ID,t.header,c.x,t.header+STUFF(c.x,1,1,'') AS value
    FROM t
    OUTER APPLY(SELECT ','+tb.data FROM t AS tb WHERE tb.header=t.header FOR XML PATH('') ) c(x)
    GROUP BY t.header,c.x

+----+------------+------------------------------------------+-----------------------------------------------+
| ID | header     | x                                        | value                                         |
+----+------------+------------------------------------------+-----------------------------------------------+
| 1  | 3,Formula, | ,1,2,3,4,5,6,7,8                         | 3,Formula,1,2,3,4,5,6,7,8                    |
| 3  | ,,         | ,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v | ,,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v    |
| 2  | *,record,  | ,abc,efg,hij,lmn,opq,rst,lmn,opq,rst     | *,record,abc,efg,hij,lmn,opq,rst,lmn,opq,rst |
| 9  | ,4,        | ,Data,1,2,3,4                            | ,4,Data,1,2,3,4                              |
+----+------------+------------------------------------------+-----------------------------------------------+

Recursive self join over file data

2 Answers2