1

We have a high volume streaming job (Spark/Kafka) and the data (avro) needs to be grouped by a timestamp field inside the payload. We are doing groupBy on RDD to achieve this: RDD[Timestamp, Iterator[Records]]. This works well for decent volume records. But for loads like 150k every 10 seconds, the shuffle read time goes beyond 10 seconds and it slows down everything.

So my question here is, switching to Structured Streaming and using groupBy on DF will help here. I understand it has Catalyst/Optimizer which helps especially in SQL like jobs. But just for grouping the data, will that help? Anyone has any idea or a use case where they had similar issue and it helped in the performance?

Shawn
  • 95
  • 1
  • 7
  • Possible duplicate of [DataFrame / Dataset groupBy behaviour/optimization](https://stackoverflow.com/q/32902982/8371915) – Alper t. Turker Feb 01 '18 at 16:03
  • I agree when it comes to aggregation, it will be beneficial with DF because it can optimize with both map and reduce side reduction. But, if I only have to groupBy and save the results for each group, will using DF give any performance gain (other than pre-selecting the column for grouping)? "However other methods of KeyValueGroupedDataset might work similarly to RDD.groupByKey." Statement in quotes in the article you mentioned sounds like it will be about the same. Am I missing anything? – Shawn Feb 01 '18 at 16:21

0 Answers0