11

I try to optimize a join query between two spark dataframes, let's call them df1, df2 (join on common column "SaleId"). df1 is very small (5M) so I broadcast it among the nodes of the spark cluster. df2 is very large (200M rows) so I tried to bucket/repartition it by "SaleId".

In Spark, what is the difference between partitioning the data by column and bucketing the data by column?

for example:

partition:

df2 = df2.repartition(10, "SaleId")

bucket:

df2.write.format('parquet').bucketBy(10, 'SaleId').mode("overwrite").saveAsTable('bucketed_table'))

After each one of those techniques I just joined df2 with df1.

I can't figure out which of those is the right technique to use. Thank you

nofar mishraki
  • 526
  • 1
  • 4
  • 15

1 Answers1

7

repartition is for using as part of an Action in the same Spark Job.

bucketBy is for output, write. And thus for avoiding shuffling in the next Spark App, typically as part of ETL. Think of JOINs. See https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4861715144695760/2994977456373837/5701837197372837/latest.html which is an excellent concise read. bucketBy tables can only be read by Spark though currently.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • 1
    but we also can write a dataframe to Disk/S3 after the repartition and when we will read it in the next spark job it will be still re-partitioned. Am I wrong? – nofar mishraki Jul 03 '19 at 06:23
  • No guarantee on reading in. Need to use correct settings. Read the link. Why would this be catered for otherwise? – thebluephantom Jul 03 '19 at 07:29
  • See https://stackoverflow.com/questions/29011574/how-does-spark-partitioning-work-on-files-in-hdfs – thebluephantom Jul 03 '19 at 08:05