Looking for efficient partitioning strategies for my dataframe when storing my dataframe in the delta table.
My current dataframe 1.5000.000 rowa it takes 3.5h to move data from dataframe to delta table.
Looking for a more efficient way to do this writing I decided to try different columns of my table as partitioning columns.I searched for the cardinality of my columns and selected the following ones.
column1 = have 3 distinct_values
column2 = have 7 distinct values
column3 = have 26 disctinc values
column4 = have 73 distinc values
column5 = have 143 distinc values
column6 = have 246 distinct values
column7 = have 543 disctinc values
cluster: 64GB, 8 cores
using the folloging code in my notebook
df.write.partitionBy("column_1").format("delta").mode("overwrite").save(partition_1)
..
df.write.partitionBy("column_7").format("delta").mode("overwrite").save(partition7)
Thus, I wanted to see which partitioning strategy would bring better results: a column with high cardinality, one with low cardinality or one in between. To my surprise this has not had any effect as it has taken practically the same time in all of them with differences of a few minutes but all of them with + 3h.
why have I failed ? is there no advantage to partitioning ?