currently having a discussion with a colleague about how caching whould benefit in the following scenario:
val dataset1 = sparkSession.read.json("...") // very expensive read
val dataset2 = sparkSession.read.json("...") // very expensive read
val joinedDataset = dataset1.join(dataset2)
val reducedDataset = joinedDataset
.mapPartitions {
???
}
.groupByKey("key")
.reduceGroups {
???
}
reducedDataset.write.json("...")
would it help (if yes please explain why) caching the joinedDataset to increase the performance of the reduce operation?
it would be:
val dataset1 = sparkSession.read.json("...") // very expensive read
val dataset2 = sparkSession.read.json("...") // very expensive read
val joinedDataset = dataset1.join(dataset2).cache
val reducedDataset = joinedDataset
.mapPartitions {
???
}
.groupByKey("key")
.reduceGroups {
???
}
reducedDataset.write.json("...")