0

I'm having a little problem getting SQL Server returning the same results as Oracle for our SQL tree query.

We have a link table with attributes

CHILD_ID, LINK_ID, PARENT_ID

and a element table with ELEMENT_ID

In Oracle we use this

SELECT * FROM LINKS
JOIN ELEMENTS ON ELEMENT_ID = CHILD_ID
WHERE LINK_ID IN 
(SELECT LINK_ID FROM LINKS CONNECT BY PRIOR CHILD_ID = PARENT_ID START WITH PARENT_ID = 'startid')

In SQL Server we use this

WITH TREE_LINKS AS
(SELECT CHILD_ID, LINK_ID FROM LINKS WHERE PARENT_ID = 'startid'
UNION ALL
SELECT CURRENT_LINKS.CHILD_ID, CURRENT_LINKS.LINK_ID
FROM LINKS CURRENT_LINKS
INNER JOIN TREE_LINKS t1 ON CURRENT_LINKS.PARENT_ID = t1.CHILD_ID)

SELECT * FROM TREE_LINKS
INNER JOIN LINKS ON TREE_LINKS.LINK_ID = LINKS.LINK_ID 
INNER JOIN ELEMENTS ON ELEMENTS.ELEMENT_ID = TREE_LINKS.CHILD_ID

This works perfectly fine apart from 1 issue.

In Oracle we only get each unique link, based on LINK_ID. In SQL Server we get all the links describing the full tree which can include duplicates when 1 element exists below multiple other elements in different branches of the structure.

This means we get a large amounts of duplicate rows from SQL Server. I tested adding

SELECT DISTINCT TREE_LINKS.LINK_ID AS TREE_LINK_ID, * from TREE_LINKS 

in the last select statement but it just takes as long as the server has more work to remove the duplicates if found in different branches.

In 1 test case at the moment we have Oracle returning 20,000 rows and SQL Server returning 1.6 million rows. So far I have found no way to get SQL Server returning the same results as fast.

FYI : Adding DISTINCT in the recursion causes

DISTINCT operator is not allowed in the recursive part of a recursive common table expression 'TREE_LINKS'.

Edit: - An example

If we have links like this

PARENT_ID, LINK_ID, CHILD_ID
1          1        2
2          2        3 
3          3        4
1          4        3

There are 4 unique elements, but there are 6 elements in the complete tree. This is because there are 2 paths to element 3

Neil Wightman
  • 1,103
  • 2
  • 9
  • 29
  • You might be getting dup children from the innerjoins, try the code I gave before adding distinct to see the difference first. Plus, look up "inner join". If two children have the same parent, then you'll get multiple records of the same parent with each child. – craniumonempty Dec 07 '11 at 15:21
  • I accidentally put a PARENT_ID where there should have been LINK_ID in one of the statements. Just in case you are testing it. – craniumonempty Dec 07 '11 at 15:49
  • Do you have any link nodes which have child_id == parent_id? Because I can see that as creating a problem for the sql-server syntax as there is no stop statement for the recursion for that. Maybe add after "WHERE TREE_LINKS.CHILD_ID = NLINKS.PARENT_ID" this "AND TREE_LINKS.LINK_ID <> NLINKS.LINK_ID" just to make sure, because as I think of it, there shouldn't be duplicates in that part. If there were, that would create them in the joins you have as well... I'm going to add that to my code just in case. – craniumonempty Dec 07 '11 at 18:04
  • CHILD_ID can never match PARENT_ID – Neil Wightman Dec 08 '11 at 08:51
  • Hmm, when 1 is startid, you have from 1 >-2-3-4 >-3-4. Ok, from that the syntax for sqlserver gives you two parents right away (parent_id = 1 gives 2 elements), so that's might be why it's giving you something different... I'm going to look it over again in a bit. – craniumonempty Dec 08 '11 at 10:01
  • I don't suppose you could set up a small table like that and test it on the first part (just getting link_id) to see what sqlserver is doing. After looking it over, it should return 5 links because the parent isn't included. The parent not being included shouldn't be a problem though, because you have more data coming back not less. The only thing I can think is happening (without further info) is that there is a loop in the chain that Oracle tests for. – craniumonempty Dec 08 '11 at 10:18
  • Hmm, well looked on the oracle site and they said "If the CONNECT BY condition results in a loop in the hierarchy, then Oracle returns an error. A loop occurs if one row is both the parent (or grandparent or direct ancestor) and a child (or a grandchild or a direct descendent) of another row." http://docs.oracle.com/cd/B13789_01/server.101/b10759/queries003.htm . So Oracle would throw an error in that case. The problem is most likely on sql-server side. Just not sure where yet. – craniumonempty Dec 08 '11 at 10:46
  • Oh, it worked... I was racking my brain with this. Not being able to test doesn't help much. Can you match the data to oracle to make sure? – craniumonempty Dec 08 '11 at 11:14
  • Good. I'm going to remove that statement from the answer since it wasn't needed. It's more of a precaution, but if you restricted the data coming in, then it's really doing nothing. Glad it worked out for you. – craniumonempty Dec 08 '11 at 11:41
  • You probably already have indexes, but just asking to make sure: do you have your fields indexed. Even with that high of a return, it shouldn't take a minute (well, unless you on on the computer I'm using). Just to check can you run "SELECT COUNT(*) FROM LINKS CONNECT BY PRIOR CHILD_ID = PARENT_ID START WITH PARENT_ID = 'startid'" to see the number that oracle is returning on the same data? I have a feeling they are getting the same amount as well in that part, but don't know for certain. – craniumonempty Dec 08 '11 at 12:09

1 Answers1

1
SELECT LINK_ID
  FROM LINKS
  START WITH PARENT_ID = 'startid'
  CONNECT BY PRIOR CHILD_ID = PARENT_ID

should be equivalent to

EDIT BELOW (AND in first edit): changed a PARENT_ID to LINK_ID in the second select

WITH TREE_LINKS(CHILD_ID, LINK_ID) AS 
   (SELECT CHILD_ID, LINK_ID
    FROM LINKS
    WHERE PARENT_ID = 'startid'
        UNION ALL
    SELECT NLINKS.CHILD_ID, NLINKS.LINK_ID
    FROM LINKS as NLINKS, TREE_LINKS
    WHERE TREE_LINKS.CHILD_ID = NLINKS.PARENT_ID)

SELECT LINK_ID FROM TREE_LINKS

according to Simulation of CONNECT BY PRIOR of ORACLE in SQL SERVER

The rest should be similar from what I can tell.

EDIT: maybe something like:

WITH TREE_LINKS(CHILD_ID, LINK_ID) AS 
   (SELECT CHILD_ID, LINK_ID
    FROM LINKS
    WHERE PARENT_ID = 'startid'
        UNION ALL
    SELECT NLINKS.CHILD_ID, NLINKS.LINK_ID
    FROM LINKS as NLINKS, TREE_LINKS
    WHERE TREE_LINKS.CHILD_ID = NLINKS.PARENT_ID)

SELECT * FROM LINKS as TLINKS
JOIN ELEMENTS ON ELEMENT_ID = TLINKS.CHILD_ID
WHERE TLINKS.LINK_ID IN
(SELECT TREE_LINKS.LINK_ID FROM TREE_LINKS)

but I don't have mssql or oracle to test it on currently

EDIT: removed "AND TREE_LINKS.LINK_ID <> NLINKS.LINK_ID" from query as it was not needed.

Community
  • 1
  • 1
craniumonempty
  • 3,525
  • 2
  • 20
  • 18
  • This returns duplicates in both cases. I.e if 2 elements in different branches of the stricture have the same child and this child has many children, you get these children links multiple times. Maybe doing an DISTINCT on this query will be faster than what I have at the moment though. – Neil Wightman Dec 07 '11 at 15:15
  • Your SQL statement appears to work well. It gets the 20,000 results in around 1 minute. 5 times faster that getting the full 1.6 million and then filtering with distinct. I need to do a little more testing. – Neil Wightman Dec 08 '11 at 10:01
  • Ok. The last query is fast, in fact the "AND TREE_LINKS.LINK_ID <> NLINKS.LINK_ID" isnt needed and makes it slightly slower. This SQL is virtually the same as mine though and I couldnt understand why yours was so much faster and returned the 21000 and not 1.6 milltion entries, then I figured it out. In your SQL the very last IN clause has the side effect of filtering duplicates very efficiently. The TREE_LINKs still returns 1.6 million rows, but the IN clause filters the results to just the 21,000 in just 58 seconds. – Neil Wightman Dec 08 '11 at 11:27
  • I just confirmed with the query planner that the tree_links still contains 1.6 million rows and is the 98% of the 1 minute now. – Neil Wightman Dec 08 '11 at 11:56