If I repartition by column name does spark understand that it is repartitioned by that column when it is read back

Question

I have a requirement, where I have a huge dataset of over 2 Trillion records. This comes as a result of some join. And post this join, I need to aggregate on a column ('id' column) and get a list of distinct names (collect_set('name')).

Now, while saving the join result in step1, if I re-partition it on 'id' field, will I get any benefit? i.e. joined_df.repartition('id').write.parquet(path)
If I read the above repartitioned df, will spark understand that it is already repartitioned on id field, so that when I group by id, performance is hugely improved?

score 1 · Accepted Answer · answered Jun 11 '23 at 07:34

If the id column is unique, then you just add a huge overhead to partition by this column, since each partition will contain one record, so assuming this is not the case!

Calling repartition('id') will create partitions based on the id column, but it will not influence Spark that the data is already partitioned when reading back.

If the data of each id can fit in one partition, I'd say you could try to:

Repartition by column and number of partitions 1, to make sure each id is in one partition only.
Read the saved data and since each partition contains the data of one id(logically) you can avoid the extra group by and map the partitions directly.

Example:

joined_df.repartition(1, 'id').write.parquet(path)
...
spark.read.parquet(path).rdd.mappartitions(FUN).toDF([id, ...])

If I repartition by column name does spark understand that it is repartitioned by that column when it is read back

1 Answers1