Questions tagged [data-lineage]
62 questions
14
votes
4 answers
What is Lineage In Spark?
How lineage helps to recompute data?
For example, I'm having several nodes computing data for 30 minutes each. If one fails after 15 minutes, can we recompute data processed in 15 minutes again using lineage without giving 15 minutes again?

Gaurav Dubey
- 305
- 1
- 2
- 10
8
votes
3 answers
How to Monitor/inspect data/attribute flow in Java code
I have a use case when I need to capture the data flow from one API to another. For example my code reads data from database using hibernate and during the data processing I convert one POJO to another and perform some more processing and then…

M.J.
- 16,266
- 28
- 75
- 97
7
votes
0 answers
Enabling Hive Lineage
I thought Hive lineage was not available, but after some research I have found that it can be enable. Some of the things I found while searching was enabling its lineage via either Cloudera Manager or IBM Infosphere, which I am not interested in.…

Pablo Ochoa
- 77
- 1
- 12
6
votes
2 answers
How can I perform data lineage in GCP?
When we realize the data lake with GCP Cloud storage, and data processing with Cloud services such as Dataproc, Dataflow, how can we generated data lineage report in GCP?

Raghavendra Prakash
- 155
- 2
- 9
3
votes
0 answers
AWS Glue- Data Lineage and Job Tracking
Is there a way to track what each job we create in AWS Glue is doing? For e.g., if jobs doing the same action are created twice, the data lineage of data while going through each transformation?

Shilpa Majumdar
- 31
- 1
2
votes
0 answers
Create data lineage on yugabyte db thru apache atlas
No much resources are available online. But i wanted to create a data lineage system on data sourcing from yugabyte db thru Apache Atlas . Any pointers are appreciated .
For e.g. Below is the process that i have
[TABLE A] --python function--> [TABLE…

Tapas Kumar Pradhan
- 73
- 6
2
votes
1 answer
Apache Spark dataframe lineage trimming via RDD and role of cache
There is the following trick how to trim Apache Spark dataframe lineage, especially for iterative computations:
def getCachedDataFrame(df: DataFrame): DataFrame = {
val rdd = df.rdd.cache()
df.sqlContext.createDataFrame(rdd, df.schema)
}
It…

alexanoid
- 24,051
- 54
- 210
- 410
2
votes
2 answers
Python Recursive Function from a 2 column Dataframe
I have the table below that I read into a dataFrame:
n,next_n
1,2
1,3
1,6
2,4
2,8
3,5
3,9
4,7
9,10
My recursive function should return multiple lists of numbers through the end.
For example if I select to see all the values associated with 9, I…

Jose
- 21
- 2
2
votes
2 answers
What are the options when it comes to handling Data Lineage in Snowflake?
Any ideas/options about handling Data Lineage in Snowflake? We are following a microservice architecture in which we are running a set of stored procedures that contain quite a few SQL queries as soon as certain events are triggered.
Example: When…

Pantelis Parastatidis
- 23
- 1
- 3
2
votes
0 answers
SQL Data Lineage Viewer
I was wondering if anyone has seen an open source tool that does something like this:
https://gudusoft.com/sqlflow/#/
I am looking for something that does field level lineage, tried looking around google and I didn't see anything.

jowparks
- 33
- 5
2
votes
1 answer
checkpointing / persisting / shuffling does not seem to 'short circuit' the lineage of an rdd as detailed in 'learning spark' book
In learning Spark, I read the following:
In addition to pipelining, Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. Spark can “short-circuit” in this…

Chris Bedford
- 2,560
- 3
- 28
- 60
2
votes
1 answer
Modelling graph in Neo4j showing workflow and impact
New to Neo4j but can see so many possibilities in graph databases, in particular IT data workflow and system impact. But unsure of the correct design for maximum efficiency.
Consider a system that takes in files, processes them, stores them in…

Mark L
- 23
- 3
2
votes
1 answer
Data Lineage in SQL Server
Objective :
Let's think of a large scale enterprise where we have a heterogeneous data stores such as SQL servers, No-SQL stores, Big data stores like ADL, ADF..etc spreads across different business groups.
Our objective is to build a lineage…

Peer Mohamed Mydeen
- 21
- 1
- 4
2
votes
2 answers
How can i see metadata, lineage of data stored in AWS redshift?
I am using solutions like cloudera navigator, atlas and Wherehows
to get Hadoop, HDFS, HIVE, SQOOP, MAPREDUCE metadata and lineage.
Now we have a data warehouse in AWS redshift as well. Is there a way to extract metadata or lineage or both…

Nik
- 431
- 1
- 6
- 10
1
vote
1 answer
How to inject inlets and outlets parameters in Airflow PythonOperator executable function
I'd like to automatically set inlets and outlets parameters in executable function inside PythonOperator.
But, it seems to me that it doesn't work while it should. You can find the code snippet below:
from datahub_provider import entities
def…

Rodion Proskuriakov
- 11
- 2