I try to optimize a join query between two spark dataframes, let's call them df1, df2 (join on common column "SaleId"). df1 is very small (5M) so I broadcast it among the nodes of the spark cluster. df2 is very large (200M rows) so I tried to bucket/repartition it by "SaleId".
In Spark, what is the difference between partitioning the data by column and bucketing the data by column?
for example:
partition:
df2 = df2.repartition(10, "SaleId")
bucket:
df2.write.format('parquet').bucketBy(10, 'SaleId').mode("overwrite").saveAsTable('bucketed_table'))
After each one of those techniques I just joined df2 with df1.
I can't figure out which of those is the right technique to use. Thank you