1

I need to get attribute/column lineage from sql query.

Basically I meant to lineage is relation in Output to Input column. Let's see an example below.

I have two tables

Table Order

OrderID  CustomerID  OrderDate
10308     2          9/18/1996
10309     37         9/19/1996
10310     77         9/20/1996

Table Customer

CustomerID  CustomerName        ContactName    Country
 1            Riya             Maria Anders    Germany
 2            John             Ana Trujillo    Mexico
 3            Anthany          Antonio Moreno  Mexico

I read both the table as DataFrame using sparksession and converted to tempView

var oDF = sparkSession.
  read.
  format("csv").
  option("header", EdlConstants.TRUE).
  option("inferschema", EdlConstants.TRUE).
  option("delimiter", ",").
  option("decoding", EdlConstants.UTF8).
  option("multiline", true).
  load("C:\\Users\\tneja\\Trunk\\scripts\\el\\el-dt\\src\\test\\resources\\files\\order.csv")

println("smaple data in oDF====>")
oDF.show()

var cusDF = sparkSession.
  read.
  format("csv").
  option("header", EdlConstants.TRUE).
  option("inferschema", EdlConstants.TRUE).
  option("delimiter", ",").
  option("decoding", EdlConstants.UTF8).
  option("multiline", true).
  load("C:\\Users\\tneja\\Trunk\\scripts\\el\\el-dt\\src\\test\\resources\\files\\customer.csv")

println("smaple data in cusDF====>")
cusDF.show()


oDF.createOrReplaceTempView("orderTempView")
cusDF.createOrReplaceTempView("customerTempView")

Now, I am joining both the table and writing query on top of that and giving alias name in select, as shown below

val res = sqlContext.sql("select OID as OID_new, CID as CID_new from (select ot.OrderID as OID,ct.CustomerID as CID,ot.OrderID+ct.CustomerName as MID  from orderTempView as ot inner join customerTempView as ct on ot.CustomerID = ct.CustomerID)")
val analyzedPlan = res.queryExecution.analyzed

The analyzedPlan looks like below.

Project [OID#36 AS OID_new#39, CID#37 AS CID_new#40]
+- Project [OrderID#0 AS OID#36, CustomerID#15 AS CID#37, (cast(OrderID#0 as 
double) + cast(CustomerName#16 as double)) AS MID#38]
+- Join Inner, (CustomerID#1 = CustomerID#15)
  :- SubqueryAlias ot
  :  +- SubqueryAlias ordertempview
  :     +- Relation[OrderID#0,CustomerID#1,OrderDate#2] csv
  +- SubqueryAlias ct
     +- SubqueryAlias customertempview
        +- Relation[CustomerID#15,CustomerName#16,ContactName#17,Country#18] csv

Seems like this logical plan is tree structure.

So I need to extract attribute values for output and input.

So I need to fetch value something like below in Map to get the relation

Map("OID_NEW" -> "orderTempView.OrderID", "CID_NEW"->"customerTempView.CustomerID")

where OID_NEW is output column and orderTempView.OrderID is input column for it, same for others.

The problem is how to traverse this logical plan (analyzedPlan). How to extract data from it in useful way?

if somebody can help me to get an output like this from an analyzedPlan, its really appreciated. Thanks!

Tarun Khaneja
  • 451
  • 11
  • 23

0 Answers0