SPARK rdd performance pipelining

Question

If we have, say, :

val rdd1 = rdd0.map( ...

followed by

val rdd2 = rdd1.filter( ...

Then, when actually running due to an action, can rdd2 start computing the already computed rdd1 results that are known - or must this wait until rdd1 work is all complete? It is not apparent to me when reading the SPARK stuff. Informatica pipelining does do this, so I assume it probably does in SPARK as well.

score 3 · Accepted Answer · 2018-06-11T10:56:06.090

3

Spark transformations are lazy so both calls doesn't do anything, beyond computing dependency DAG. So your code doesn't even touch the data.

For anything to be computed you have to execute an action on rdd2 or one of its descendants.
By default there are also forgetful, so unless you cache rdd1 it will be evaluated all over again, every time rdd2 is evaluated.
Finally, due to lazy evaluation, multiple narrow transformations are combined together in a single stage and your code will interleave calls to map and filter functions.

edited Jun 11 '18 at 10:56

answered Jun 11 '18 at 10:49

1

I know that obviously. My question states when actually running. I will edit it. – thebluephantom Jun 11 '18 at 10:54
Your last point seems most relevant to my question. The interleaving you mention seems to mean to me that pipeling as in Informatica occurs. Correct? – thebluephantom Jun 11 '18 at 10:59
Can you clarify 2nd point pls? Are you saying that if rdd1 has 1000 elements, then without caching, rdd0 will be evaluated a 1000 times when rdd2 computed? I cache standardly anyway. – thebluephantom Jun 11 '18 at 11:10
1

@thebluephantom These can be useful for you: [Spark + Scala transformations, immutability & memory consumption overheads](https://stackoverflow.com/a/35146900/8371915), [(Why) do we need to call cache or persist on a RDD](https://stackoverflow.com/q/28981359/8371915) – Alper t. Turker Jun 11 '18 at 11:10
The directly above point clarifies the comment on "can you clarify the 2nd point pls...". SO 28981359 from user8371915 - the element calculation not so bad as I thought was being stated. thx and thx to all – thebluephantom Jun 11 '18 at 11:31
@user9924728 " ... By default there are also forgetful, so unless you cache rdd1 it will be evaluated all over again, every time rdd2 is evaluated. ..." Can you clarify every time please in context as I am assuming that you do not mean worst case what I wrote as a comment ... thx – thebluephantom Jun 11 '18 at 11:51

SPARK rdd performance pipelining

1 Answers1