Is there a reliable way to predict which Spark dataframe operations will preserve partitioning and which won't?
Specifically, let's say my dataframes are all partitioned with .repartition(500,'field1','field2'). Can I expect an output with 500 partitions arranged by these same fields if I apply:
- select()
- filter()
- groupBy() followed by agg() when grouping happens on 'field1' and 'field2' (as in the above)
- join() on 'field1' and 'field2' when both dataframes are partitioned as above
Given the special way my data is prepartitioned, I'd expect no extra shuffling to take place. However, I always seem to end up with at least few stages having number of tasks equal to spark.sql.shuffle.partitions. Any way to avoid that extra shuffling bit?
Thanks