We have a high volume streaming job (Spark/Kafka) and the data (avro) needs to be grouped by a timestamp field inside the payload. We are doing groupBy on RDD to achieve this: RDD[Timestamp, Iterator[Records]]. This works well for decent volume records. But for loads like 150k every 10 seconds, the shuffle read time goes beyond 10 seconds and it slows down everything.
So my question here is, switching to Structured Streaming and using groupBy on DF will help here. I understand it has Catalyst/Optimizer which helps especially in SQL like jobs. But just for grouping the data, will that help? Anyone has any idea or a use case where they had similar issue and it helped in the performance?