Does using multiple columns in partitioning Spark DataFrame makes read slower?

Question

I wonder if using multiple columns while writing a Spark DataFrame in spark makes future read slower? I know partitioning with critical columns for future filtering improves read performance, but what would be the effect of having multiple columns, even the ones not usable for filtering?

A sample would be:

(ordersDF
  .write
  .format("parquet")
  .mode("overwrite")
  .partitionBy("CustomerId", "OrderDate", .....) # <----------- add many columns
  .save("/storage/Orders_parquet"))

You are mixing different issues in the same question. Data should be partitioned according to future queries. In any case, you usually don't want to partition by a column with such a high cardinality such as `customerId`. You will end up with as many directories as there are users in your dataset. — shay__, May 19 '20 at 09:22
I just find this: There would be performance implications adding unnecessary columns in PartitionBy. Using columns with bounded values (Spark Reference: In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.) that are going to have read predicates would be a great choice for partitioning. But adding columns without any use in filtering would affect the performance. — Sasan Ahmadi, May 21 '20 at 20:02

score -1 · Answer 1 · answered May 19 '20 at 04:34

-1

Yes as spark have to do shuffle and short data to make so may partition .

As there will have many combination of partition key .

ie

 suppose CustomerId have unique values  10 
 suppose orderDate have unique values   10 
 suppose Orderhave unique values        10 

 Number of partition will be 10 *10*10

In this small scenario we have 1000 bucket need to be created.

so hell loot of shuffle and short >> more time .

answered May 19 '20 at 04:34

sandeep rawat

4,797
1
18
36

1

this is not true - no shuffle needed here – shay__ May 19 '20 at 06:21
@shay__ please let me know why shuffle need .. or why there will be no movement in data frames .. – sandeep rawat May 19 '20 at 08:42
OP is using `df.write.partitionBy`, and you seem co confuse with `df.repartition`. – shay__ May 19 '20 at 09:15
@shay__ this url will help https://stackoverflow.com/questions/50775870/pyspark-efficiently-have-partitionby-write-to-same-number-of-total-partitions-a – sandeep rawat May 19 '20 at 15:42

Does using multiple columns in partitioning Spark DataFrame makes read slower?

1 Answers1