-1

I wonder if using multiple columns while writing a Spark DataFrame in spark makes future read slower? I know partitioning with critical columns for future filtering improves read performance, but what would be the effect of having multiple columns, even the ones not usable for filtering?

A sample would be:

(ordersDF
  .write
  .format("parquet")
  .mode("overwrite")
  .partitionBy("CustomerId", "OrderDate", .....) # <----------- add many columns
  .save("/storage/Orders_parquet"))
Sasan Ahmadi
  • 567
  • 4
  • 19
  • 2
    You are mixing different issues in the same question. Data should be partitioned according to future queries. In any case, you usually don't want to partition by a column with such a high cardinality such as `customerId`. You will end up with as many directories as there are users in your dataset. – shay__ May 19 '20 at 09:22
  • I just find this: There would be performance implications adding unnecessary columns in PartitionBy. Using columns with bounded values (Spark Reference: In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.) that are going to have read predicates would be a great choice for partitioning. But adding columns without any use in filtering would affect the performance. – Sasan Ahmadi May 21 '20 at 20:02

1 Answers1

-1

Yes as spark have to do shuffle and short data to make so may partition .

As there will have many combination of partition key .

ie

 suppose CustomerId have unique values  10 
 suppose orderDate have unique values   10 
 suppose Orderhave unique values        10 

 Number of partition will be 10 *10*10

In this small scenario we have 1000 bucket need to be created.

so hell loot of shuffle and short >> more time .

sandeep rawat
  • 4,797
  • 1
  • 18
  • 36