Questions tagged [data-lineage]

62 questions
14
votes
4 answers

What is Lineage In Spark?

How lineage helps to recompute data? For example, I'm having several nodes computing data for 30 minutes each. If one fails after 15 minutes, can we recompute data processed in 15 minutes again using lineage without giving 15 minutes again?
Gaurav Dubey
  • 305
  • 1
  • 2
  • 10
8
votes
3 answers

How to Monitor/inspect data/attribute flow in Java code

I have a use case when I need to capture the data flow from one API to another. For example my code reads data from database using hibernate and during the data processing I convert one POJO to another and perform some more processing and then…
M.J.
  • 16,266
  • 28
  • 75
  • 97
7
votes
0 answers

Enabling Hive Lineage

I thought Hive lineage was not available, but after some research I have found that it can be enable. Some of the things I found while searching was enabling its lineage via either Cloudera Manager or IBM Infosphere, which I am not interested in.…
Pablo Ochoa
  • 77
  • 1
  • 12
6
votes
2 answers

How can I perform data lineage in GCP?

When we realize the data lake with GCP Cloud storage, and data processing with Cloud services such as Dataproc, Dataflow, how can we generated data lineage report in GCP?
3
votes
0 answers

AWS Glue- Data Lineage and Job Tracking

Is there a way to track what each job we create in AWS Glue is doing? For e.g., if jobs doing the same action are created twice, the data lineage of data while going through each transformation?
2
votes
0 answers

Create data lineage on yugabyte db thru apache atlas

No much resources are available online. But i wanted to create a data lineage system on data sourcing from yugabyte db thru Apache Atlas . Any pointers are appreciated . For e.g. Below is the process that i have [TABLE A] --python function--> [TABLE…
2
votes
1 answer

Apache Spark dataframe lineage trimming via RDD and role of cache

There is the following trick how to trim Apache Spark dataframe lineage, especially for iterative computations: def getCachedDataFrame(df: DataFrame): DataFrame = { val rdd = df.rdd.cache() df.sqlContext.createDataFrame(rdd, df.schema) } It…
alexanoid
  • 24,051
  • 54
  • 210
  • 410
2
votes
2 answers

Python Recursive Function from a 2 column Dataframe

I have the table below that I read into a dataFrame: n,next_n 1,2 1,3 1,6 2,4 2,8 3,5 3,9 4,7 9,10 My recursive function should return multiple lists of numbers through the end. For example if I select to see all the values associated with 9, I…
Jose
  • 21
  • 2
2
votes
2 answers

What are the options when it comes to handling Data Lineage in Snowflake?

Any ideas/options about handling Data Lineage in Snowflake? We are following a microservice architecture in which we are running a set of stored procedures that contain quite a few SQL queries as soon as certain events are triggered. Example: When…
2
votes
0 answers

SQL Data Lineage Viewer

I was wondering if anyone has seen an open source tool that does something like this: https://gudusoft.com/sqlflow/#/ I am looking for something that does field level lineage, tried looking around google and I didn't see anything.
jowparks
  • 33
  • 5
2
votes
1 answer

checkpointing / persisting / shuffling does not seem to 'short circuit' the lineage of an rdd as detailed in 'learning spark' book

In learning Spark, I read the following: In addition to pipelining, Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. Spark can “short-circuit” in this…
Chris Bedford
  • 2,560
  • 3
  • 28
  • 60
2
votes
1 answer

Modelling graph in Neo4j showing workflow and impact

New to Neo4j but can see so many possibilities in graph databases, in particular IT data workflow and system impact. But unsure of the correct design for maximum efficiency. Consider a system that takes in files, processes them, stores them in…
2
votes
1 answer

Data Lineage in SQL Server

Objective : Let's think of a large scale enterprise where we have a heterogeneous data stores such as SQL servers, No-SQL stores, Big data stores like ADL, ADF..etc spreads across different business groups. Our objective is to build a lineage…
2
votes
2 answers

How can i see metadata, lineage of data stored in AWS redshift?

I am using solutions like cloudera navigator, atlas and Wherehows to get Hadoop, HDFS, HIVE, SQOOP, MAPREDUCE metadata and lineage. Now we have a data warehouse in AWS redshift as well. Is there a way to extract metadata or lineage or both…
1
vote
1 answer

How to inject inlets and outlets parameters in Airflow PythonOperator executable function

I'd like to automatically set inlets and outlets parameters in executable function inside PythonOperator. But, it seems to me that it doesn't work while it should. You can find the code snippet below: from datahub_provider import entities def…
1
2 3 4 5