0

I am exploring operation fusion capabilities of Spark and am curios if Spark can fuse a filter followed by a map into a single operation, e.g.

val names = sc.parallelize(List("Subhrajit Bhattacharya", "John Doe"))

val longNames = names.filter( x => x.length > 10)
val splitLongNames = longNames.map(x => x.split(" ").toList)

If so, what will the code for that function be ? Also is there any way of knowing which operations Spark is fusing ?

Thanks.

  • Possible duplicate of [When does a RDD lineage is created? How to find lineage graph?](https://stackoverflow.com/questions/47693355/when-does-a-rdd-lineage-is-created-how-to-find-lineage-graph) – philantrovert Oct 05 '18 at 16:03

1 Answers1

0

Yes it can and Spark will "fuse" as much as possible.

These are Stages that have no shuffling requirements - i.e. no need for moving data around - unlike, say, groupByKey which does, in order to achieve the desired outcome.

The new RDD generated follows the parent RDD in such cases. And hence fusing is possible and indeed always the intent.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83