I have a huge data , which is accumalated each year , quarterly-wise. This data is skewed a bit , when I try to get all data into one dataframe by repartitoning it into ("year", "quarter") it is shuffling a lot of data on disk spill which is making my job slow , more over only one executor working 80% of the time.
Hence I decided to 1) get distinct groups of dataframe , grouping by year and quarter-wise. 2) iterate/loop this distinct data frame by group fetch the data group where = year of the group save this dataframe/grop as parquet file continue iteration.
In Java we can use for loop on groups
but in spark with scala how to do it ?