cache or not cache in groupByKey

Question

currently having a discussion with a colleague about how caching whould benefit in the following scenario:

  val dataset1 = sparkSession.read.json("...") // very expensive read
  val dataset2 = sparkSession.read.json("...") // very expensive read

  val joinedDataset = dataset1.join(dataset2)

  val reducedDataset = joinedDataset
    .mapPartitions {
      ???
    }
    .groupByKey("key")
    .reduceGroups {
      ???
    }

  reducedDataset.write.json("...")

would it help (if yes please explain why) caching the joinedDataset to increase the performance of the reduce operation?

it would be:

  val dataset1 = sparkSession.read.json("...") // very expensive read
  val dataset2 = sparkSession.read.json("...") // very expensive read

  val joinedDataset = dataset1.join(dataset2).cache

  val reducedDataset = joinedDataset
    .mapPartitions {
      ???
    }
    .groupByKey("key")
    .reduceGroups {
      ???
    }

  reducedDataset.write.json("...")

score 1 · Accepted Answer · answered Feb 01 '18 at 16:13

1

You should be benchmark it but it'll either won't have effect at all or even degrade performance:

No effect at all because cached data is not reused. Even it was reused, join would be a barrier of recomputation.
Might degrade performance because caching in general is expensive.

answered Feb 01 '18 at 16:13

Alper t. Turker

34,230
9
83
115

thank you. even if the reduce operation is applied on a non joined data i.e. dataset1, no recomputation is needed for the reduce operation right? – Enrique Molina Feb 01 '18 at 16:16
Yes, all computations are pipelined and there is no branching here. Also see this https://stackoverflow.com/q/34580662/8371915 – Alper t. Turker Feb 02 '18 at 08:57

cache or not cache in groupByKey

1 Answers1