1

I am trying to get the path from the root node to its children using SQL server.

The source data looks like :

Source Data

The target should look like :

Target Data

As I am going to implement this in an ETL tool exclusively using the ETL transformations, I would like to achieve this output without using CONNECT BY equivalent approach. Below query gets me the result and a few more records :

select case when level02.geography_02 is not NULL
    then '3'
    else case when level01.geography_02 is not null
                then '2'
                else case when root.geography_02 is not null
                    then '1'
                    end 
        end
end as levels,
root.geography_01 as root, root.geography_02 as super_parent,
case when level01.geography_02 is not null
        then level01.geography_02
        else ''
        end as parent,
case when level02.geography_02 is not null
        then level02.geography_02
        else ''
        end as child
from geo_table root
left join geo_table level01
on root.geography_02 = level01.geography_01
left join geo_table level02
on level01.geography_02 = level02.geography_01

Can you please on how to get the desired output?

Techblink
  • 15
  • 5
  • You say you want to achieve this with an ETL tool, however, the solution, or attempt, you have above is T-SQL based. if you want an ETL Tool solution, you should be tagging the ETL tool you're using. What's wrong with what you have above? Are the number of level known, or could there be more? – Thom A Mar 02 '20 at 12:47
  • Thanks for your reply. Once, I have the SQL ready, I am going to design this in Informatica. I have tagged this question with SQL server, because I know how to get this done using simple joins. Now, CONNECT BY in Oracle provides a way to achieve this quite simply. And I assume there would be a SQL server equivalent. However, there is no direct equivalent transformation for CONNECT by in Informatica, due to which I would have to use SQL overrides which I am hoping to avoid. – Techblink Mar 02 '20 at 12:55
  • Does this answer your question? https://stackoverflow.com/questions/959804/simulation-of-connect-by-prior-of-oracle-in-sql-server – Neville Kuyt Mar 02 '20 at 12:58
  • Above SQL is giving me a few extra records with incorrect levels : Levels Root Super Parent Parent Child 2 England London Greenwich 2 England London Mayfair 1 England Manchester 1 England Birmingham 1 London Greenwich 1 London Mayfair – Techblink Mar 02 '20 at 13:01
  • @techblink As you're recursing down a tree, you need to identify where to start from (your query starts from everywhere). You can hardcode the top level in to your query *(`WHERE root.geography_01 = 'United Kingdom'`)*, you can write a query to find which record(s) to start from *(Gordon's answer)*, or you can make it easier for yourself with an explicit top node *`(NULL, 'United Kingdom')`*, then just use *`WHERE root.geography_01 IS NULL`*. – MatBailie Mar 02 '20 at 13:29
  • As for CONNECT BY, you can use SQL Server to do recursive Common Table Expressions. https://learn.microsoft.com/en-us/sql/t-sql/queries/with-common-table-expression-transact-sql?view=sql-server-ver15#guidelines-for-defining-and-using-recursive-common-table-expressions – MatBailie Mar 02 '20 at 13:34
  • Finally, depending on what you need this for, you might find that Nested Sets are a better model for exploring down a tree (much faster, but awful for climbing back up a tree). *(Google nested sets as a personal learning exercise, it's very helpful for gaining a different perspective on sql structures, performance, etc)* – MatBailie Mar 02 '20 at 13:40
  • @MatBailie Hardcoding is not an option for me as there are going to be multiple root nodes each with their own tree structure coming from the source data. It will like one tree per country, just don't know how many countries will in source. Do you think I should still be able to use the explicit top node method you mentioned? – Techblink Mar 02 '20 at 13:42
  • 1
    Yes, you just have multiple top nodes, by starting from those (with NULL for their parent geography) you explore all the trees together. Either with hard coded joins or recursive CTEs. Nested sets also handles this perfectly well – MatBailie Mar 02 '20 at 14:23

1 Answers1

0

I think you just need some filtering. That said, the rest of your query can also be simplified a bit -- particularly by using COALESCE():

select (case when level02.geoghraphy_02 is not NULL then '3'
             when level01.geoghraphy_02 is not null then '2'
             when root.geoghraphy_02 is not null then '1'
        end) as levels,
       root.geoghraphy_01 as root,
       root.geoghraphy_02 as super_parent,
       coalesce(level01.geography_02, '') as parent,
       coalesce(level02.geography_02, '') as child
from geo_table root left join
     geo_table level01
     on root.geography_02 = level01.geography_01 left join
     geo_table level02
     on level01.geography_02 = level02.geography_01
where not exists (select 1
                  from geo_table gt
                  where gt.geography_02 = root.geography_01
                 );

Basically, you just need to limit the "root" to actual root records. You've actually handled the trickier part of the logic (in my opinion).

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks @Gordon for the quick reply. It seems like this filter will do the job just fine. I'll check and get back to you. – Techblink Mar 02 '20 at 13:12