3

Looking for efficient partitioning strategies for my dataframe when storing my dataframe in the delta table.

My current dataframe 1.5000.000 rowa it takes 3.5h to move data from dataframe to delta table.

Looking for a more efficient way to do this writing I decided to try different columns of my table as partitioning columns.I searched for the cardinality of my columns and selected the following ones.

column1 = have 3 distinct_values
column2 = have 7 distinct values
column3 = have 26 disctinc values
column4 = have 73 distinc values
column5 = have 143 distinc values
column6 = have 246 distinct values
column7 = have 543 disctinc values

cluster: 64GB, 8 cores

using the folloging code in my notebook

df.write.partitionBy("column_1").format("delta").mode("overwrite").save(partition_1)
..
df.write.partitionBy("column_7").format("delta").mode("overwrite").save(partition7)

Thus, I wanted to see which partitioning strategy would bring better results: a column with high cardinality, one with low cardinality or one in between. To my surprise this has not had any effect as it has taken practically the same time in all of them with differences of a few minutes but all of them with + 3h.

why have I failed ? is there no advantage to partitioning ?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Enrique Benito Casado
  • 1,914
  • 1
  • 20
  • 40

1 Answers1

6

When you use Delta (either Databricks or OSS Delta 1.2.x, better 2.0) then often you may not need to use partitioning at all for following reasons (that aren't applicable for Parquet or other file formats):

  • Delta supports data skipping that allows to read only necessary files, especially effective when you use it in combination with OPTIMIZE ZORDER BY that will put related data closer to each other.
  • Bloom filters allow to skip files even more granularly.

The rules of thumb of using partitioning with Delta lake tables are following:

  • use it when it will benefit queries, especially when you perform MERGE into the table, because it allows to avoid conflicts between parallel transactions
  • when it helps to delete old data (for example partitioning by date)
  • when it really benefits your queries. For example, you have data per country, and most of queries will use country as a part of condition. Or for example, when you partition by date, and querying data based on the time...

In all cases, don't use partitioning for high cardinality columns (hundreds of values) and having too many partition columns because in most cases it lead to creation of small files that are less efficient to read (each file is accessed separately), plus it leads to increased load to the driver as it needs to keep metadata for each of the file.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • 1
    @alex-ott I disagree with omitting partitioning for a delta table just for the same reasons you said _especially effective when you use it in combination with OPTIMIZE ZORDER BY_ since `OPTIMIZE` can be applied selectively to partitions but more importantly leverage Spark SQL's support for partition pruning with no need for Delta's statistics (and hence data skipping). – Jacek Laskowski Jul 27 '22 at 11:36
  • 1
    That's why I wrote about *may not need*, not saying that it shouldn't be used – Alex Ott Jul 27 '22 at 12:10
  • Hi @AlexOtt what would you recommend when it does not fall under any "rule of thum" i.e.: No Merge, no delte old data, the column will not be part of a where clause, is it still worth doing partitios and how much 1 every 10 mill rows ? – Enrique Benito Casado Aug 16 '22 at 13:48
  • 1
    Yes, you can omit partitioning. Although sometimes it's useful for `OPTIMIZE` where you can optimize only "newer" partitions – Alex Ott Aug 16 '22 at 15:38
  • Appreciate so much your help, wish you the best – Enrique Benito Casado Aug 16 '22 at 21:45