How does Apache Spark store lineages?

Question

Apache spark claims it will store the lineages instead of the RDD's itself so that it can recompute in case of a failure. I am wondering how it stores the lineages? For example an RDD can be made of bunch of user provided transformation functions so does it store the "source code of those user provided functions" ?

score 4 · Accepted Answer · answered Feb 15 '16 at 14:52

4

Simplifying things a little bit RDDs are recursive data structures which describe lineages. Each RDD has a set of dependencies and it is computed in a specific context. Functions which are passed to Spark actions and transformations are first-class objects, can be stored, assigned, passed around and captured as a part of the closure and there is no reason (no to mention means) to store source code.

RDDs belong to the Driver and are not equivalent to the data. When data is accessed on the workers, RDDs are long gone and the only thing that matters is a given task.

answered Feb 15 '16 at 14:52

zero323

322,348
103
959
935

This is a good answer but can you provide a simple example on how those closures are stored and passed around ? I am using the word stored here because when I read the spark paper it says this stored somewhere so that if a node fails it will recompute itself after the retrieval. – user1870400 Feb 15 '16 at 16:58
Well, AFAIK there is nothing really specific to Spark here. These are standard language tools. If you're looking for something easy to understand take a look at python [`globals` and `locals`](http://stackoverflow.com/q/7969949/1560062) add robust serialization mechanism (see Spark's `cloud_pickle`) and you're ready to go. – zero323 Feb 15 '16 at 23:00
ok Cloud_pickle is more or less what I am looking for but when I read the code from the link below it looks like it indeed is storing the byte code of functions and since in spark you can specify the storage level and say I specify the storage level as disk I am assuming the byte code will be stored in the disk which can further be used to recompute RDD at a later time. Do you agree so far? https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py#L282 – user1870400 Feb 16 '16 at 07:09
Partially. There is no reason to persist. All of that belongs to the driver process. – zero323 Feb 16 '16 at 19:06
The reason I said persist is because what if there is a node failure and RDD needs to be recomputed. if you see the spark paper it clearly says it will have to go through all the transformations (the lineage) to recreate it. Do you agree now better? – user1870400 Feb 16 '16 at 20:45
No. RDD is stored only on the `Driver`. It is not affected by worker nodes failure. If driver is lost then a whole application is lost anyway. – zero323 Feb 16 '16 at 20:52
ok so in that case you are essentially saying all the transformations are stored in memory. but what if you are out of memory even with todays storage being cheap? My biggest question here is that is there any possible way the transformations are stored in disk? or would you say that will never happen? – user1870400 Feb 16 '16 at 22:03
Yes this is exactly what I am saying. These are local objects.There are no different than any other object you use. There can land on disk using standard OS mechanism like virtual memory but it is completely transparent to your app. Once again RDD is not data. Data is handled somewhere else (workers), can be serialized, dumped to disk and so on. – zero323 Feb 16 '16 at 22:08
you are indeed saying transformations will never be persisted on a disk like say write data/transformations to a file but yeah they can on land disk using standard OS mechanisms/virtual memory like you mentioned. – user1870400 Feb 16 '16 at 22:13
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/103653/discussion-between-zero323-and-user1870400). – zero323 Feb 16 '16 at 22:17

How does Apache Spark store lineages?

1 Answers1