We are facing poor performance using Spark.
I have 2 specific questions:
- When debugging we noticed that a few of the groupby operations done on Rdd are taking more time
- Also a few of the stages are appearing twice, some finishing very quickly, some taking more time
Here is a screenshot of .
Currently running locally, having shuffle partitions set to 2 and number of partitions set to 5, data is around 1,00,000 records.
Speaking of groupby operation, we are grouping a dataframe (which is a result of several joins) based on two columns, and then applying a function to get some result.
val groupedRows = rows.rdd.groupBy(row => (
row.getAs[Long](Column1),
row.getAs[Int](Column2)
))
val rdd = groupedRows.values.map(Criteria)
Where Criteria is some function acted on the grouped resultant rows. Can we optimize this group by in any way?
Here is a screenshot of the .