I need to get attribute/column lineage from sql query.
Basically I meant to lineage is relation in Output to Input column. Let's see an example below.
I have two tables
Table Order
OrderID CustomerID OrderDate
10308 2 9/18/1996
10309 37 9/19/1996
10310 77 9/20/1996
Table Customer
CustomerID CustomerName ContactName Country
1 Riya Maria Anders Germany
2 John Ana Trujillo Mexico
3 Anthany Antonio Moreno Mexico
I read both the table as DataFrame using sparksession and converted to tempView
var oDF = sparkSession.
read.
format("csv").
option("header", EdlConstants.TRUE).
option("inferschema", EdlConstants.TRUE).
option("delimiter", ",").
option("decoding", EdlConstants.UTF8).
option("multiline", true).
load("C:\\Users\\tneja\\Trunk\\scripts\\el\\el-dt\\src\\test\\resources\\files\\order.csv")
println("smaple data in oDF====>")
oDF.show()
var cusDF = sparkSession.
read.
format("csv").
option("header", EdlConstants.TRUE).
option("inferschema", EdlConstants.TRUE).
option("delimiter", ",").
option("decoding", EdlConstants.UTF8).
option("multiline", true).
load("C:\\Users\\tneja\\Trunk\\scripts\\el\\el-dt\\src\\test\\resources\\files\\customer.csv")
println("smaple data in cusDF====>")
cusDF.show()
oDF.createOrReplaceTempView("orderTempView")
cusDF.createOrReplaceTempView("customerTempView")
Now, I am joining both the table and writing query on top of that and giving alias name in select, as shown below
val res = sqlContext.sql("select OID as OID_new, CID as CID_new from (select ot.OrderID as OID,ct.CustomerID as CID,ot.OrderID+ct.CustomerName as MID from orderTempView as ot inner join customerTempView as ct on ot.CustomerID = ct.CustomerID)")
val analyzedPlan = res.queryExecution.analyzed
The analyzedPlan looks like below.
Project [OID#36 AS OID_new#39, CID#37 AS CID_new#40]
+- Project [OrderID#0 AS OID#36, CustomerID#15 AS CID#37, (cast(OrderID#0 as
double) + cast(CustomerName#16 as double)) AS MID#38]
+- Join Inner, (CustomerID#1 = CustomerID#15)
:- SubqueryAlias ot
: +- SubqueryAlias ordertempview
: +- Relation[OrderID#0,CustomerID#1,OrderDate#2] csv
+- SubqueryAlias ct
+- SubqueryAlias customertempview
+- Relation[CustomerID#15,CustomerName#16,ContactName#17,Country#18] csv
Seems like this logical plan is tree structure.
So I need to extract attribute values for output and input.
So I need to fetch value something like below in Map to get the relation
Map("OID_NEW" -> "orderTempView.OrderID", "CID_NEW"->"customerTempView.CustomerID")
where OID_NEW is output column and orderTempView.OrderID is input column for it, same for others.
The problem is how to traverse this logical plan (analyzedPlan). How to extract data from it in useful way?
if somebody can help me to get an output like this from an analyzedPlan, its really appreciated. Thanks!