1

currently having a discussion with a colleague about how caching whould benefit in the following scenario:

  val dataset1 = sparkSession.read.json("...") // very expensive read
  val dataset2 = sparkSession.read.json("...") // very expensive read

  val joinedDataset = dataset1.join(dataset2)

  val reducedDataset = joinedDataset
    .mapPartitions {
      ???
    }
    .groupByKey("key")
    .reduceGroups {
      ???
    }

  reducedDataset.write.json("...")

would it help (if yes please explain why) caching the joinedDataset to increase the performance of the reduce operation?

it would be:

  val dataset1 = sparkSession.read.json("...") // very expensive read
  val dataset2 = sparkSession.read.json("...") // very expensive read

  val joinedDataset = dataset1.join(dataset2).cache

  val reducedDataset = joinedDataset
    .mapPartitions {
      ???
    }
    .groupByKey("key")
    .reduceGroups {
      ???
    }

  reducedDataset.write.json("...")
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245

1 Answers1

1

You should be benchmark it but it'll either won't have effect at all or even degrade performance:

  • No effect at all because cached data is not reused. Even it was reused, join would be a barrier of recomputation.
  • Might degrade performance because caching in general is expensive.
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • thank you. even if the reduce operation is applied on a non joined data i.e. dataset1, no recomputation is needed for the reduce operation right? – Enrique Molina Feb 01 '18 at 16:16
  • Yes, all computations are pipelined and there is no branching here. Also see this https://stackoverflow.com/q/34580662/8371915 – Alper t. Turker Feb 02 '18 at 08:57