2

Objective :

Let's think of a large scale enterprise where we have a heterogeneous data stores such as SQL servers, No-SQL stores, Big data stores like ADL, ADF..etc spreads across different business groups.

Our objective is to build a lineage service at an enterprise level, which will provide insights on the data lineage between different data store instances within an enterprise. This will help us in gaining insights on the data proliferation within the enterprise.

For this, as a phase 1 we want to take SQL server and want to build the lineage within and across SQL servers and It's databases.

Is there a way we can build / extract (if available) the data lineage (table & Column level) from a SQL server by leveraging SQL profiling or transaction logs within or across instances.

I looked at Determining Impact and Data Lineage and looks like the SQL server should have the Dependency Services enabled.

As a enterprise level service, I may have access to the SQL Servers and may not have control over the sql server instances on enabling any services.

Could anyone share any insights or experiences with respect to building or extracting Data Lineage within or across SQL Servers.

Thanks in Advance !

Mydeen

  • 1
    You can't find out where data came from by querying the database that it's in. You need to explcitly store that somewhere – Nick.Mc Feb 16 '18 at 07:46
  • 1
    Those links look pretty old. I'd be suprised if that's still a thing – Nick.Mc Feb 16 '18 at 07:53
  • I completely agree on your point on querying the db won't give any lineage info. But, I am just looking for any OOB feature that I can leverage to build or extract the lineage information without building a component which profiles the SQL executions and derive lineage information. I have updated my question to better explain myself. – Peer Mohamed Mydeen Feb 16 '18 at 12:55
  • I read some more on impact and data lineage and found an article dated 2010 so I guess it's gone by the wayside. – Nick.Mc Feb 16 '18 at 13:44
  • That's an interesting approach - analysing logs to sniff out data flow. If you proceeded further with this I would suggest something that has worked for me in the past: 1. Get a list of database tables, views, columns etc from system tables; 2. Search logs, ssis packages, anything that is text based, using these search terms. This works much better than picking up SQL and trying to decipher it. I used this approach in the past to build up a data dictionary for all of my packages and stored procedures. – Nick.Mc Feb 16 '18 at 13:47
  • I guess you could create a server side trace to capture the logs for analysis. Can I also suggest that you get in the habit of populating the _application_ property of any and all connection strings so you can identify what is connecting and doing stuff. It seems like you may not have control or know about some integrations, but if you do it helps a lot. For example you can identify which SSIS package is doing what. – Nick.Mc Feb 16 '18 at 13:49

1 Answers1

1

You may export database structure into a SQL file which includes CREATE TABLE/CREATE VIEW/ CREATE PROCEDURE and other SQL statements. Then using a SQL parser to analyze those scripts to get metadata from those SQL scripts and build data lineage of the table/columns.

Take this SQL for example:

create view deptsal (dept_no, dept_name, salary)
as
SELECT d.deptno, 
       d.dname, 
       SUM(e.sal + Nvl(e.comm, 0)) AS sal 
FROM   dept d 
       left join (SELECT * 
                  FROM   emp 
                  WHERE  hiredate > DATE '1980-01-01') e 
              ON e.deptno = d.deptno 
GROUP  BY d.deptno, 
          d.dname;

You will get the data flow like: deptsal.salary depends on emp.sal, emp.com

James
  • 11
  • 1