0

I have data that is partitioned as statically partitioned by data and dynamically partitioned by country. So for each date, I could have as much as 180 country partitions. Looks something like this:

/20180101/cntry=us/ => 100kb
         /cntry=ca/ => 500kb
         /cntry=uk/ => 1.5mb

For each date, the data is small (around 20-100mb) and it is divided among the country partitions. I was wondering for a situation like this, which method would be better? Repartition or coalesce? Since the data is small, would coalesce be better? I am very confused as to when coalesce or repartition would be a better choice depending on the size of the data.

user2896120
  • 3,180
  • 4
  • 40
  • 100
  • Useful links [link-1](https://stackoverflow.com/questions/51628958/spark-savewrite-parquet-only-one-file/51631645#comment90229298_51631645), [link-2](https://stackoverflow.com/questions/42034314/does-coalescenumpartitions-in-spark-undergo-shuffling-or-not/42036286#comment87220941_42036286) – y2k-shubham Jan 17 '19 at 06:39
  • @y2k-shubham If I use repartition, how would I know the number of partitions? – user2896120 Jan 17 '19 at 06:51
  • As already pointed out by [**@Lior Chaga**](https://stackoverflow.com/users/2204206/lior-chaga), this is a painful [**trial-and-error** process](https://stackoverflow.com/questions/51814680/how-to-auto-calculate-numrepartition-while-using-spark-dataframe-write#comment90588402_51814680) – y2k-shubham Jan 17 '19 at 07:11

1 Answers1

3

I have really bad experience with Coalesce due to the uneven distribution of the data. The biggest difference of Coalesce and Repartition is that Repartitions calls a full shuffle creating balanced NEW partitions and Coalesce uses the partitions that already exists but can create partitions that are not balanced, that can be pretty bad for consumers of the data at the downstream.

In your case, coalesce will not create a big impact due to your data is already partitioned by country. And also the data is pretty small and will be ok. But for development perspective I personally use repartition.

More details you can see here in this blog post.

Thiago Baldim
  • 7,362
  • 3
  • 29
  • 51
  • 2
    Agree with **@Thiago Baldim**, `coalesce` has [bitten me too](https://stackoverflow.com/q/49891929/3679900) – y2k-shubham Jan 17 '19 at 06:41
  • Hmm, if I use repartition, how would I know the number of repartitions to create though? – user2896120 Jan 17 '19 at 06:47
  • how does it differ from coalesce? you also need to decide on number of partitions – Lior Chaga Jan 17 '19 at 06:50
  • @LiorChaga But how do you decide on the number of partitions? – user2896120 Jan 17 '19 at 06:52
  • 2
    You can start with gut feeling, and proceed with trial and error. Eventually you should consider not only the job performance itself, but also it's impact on storage or downstream jobs. For instance - having no repartition at all when writing to HDFS might results in many tiny files, which is bad practice for HDFS. – Lior Chaga Jan 17 '19 at 06:58
  • @LiorChaga Hmm, what about repartitioning by column? Would that be better than setting a fixed number to repartition by? – user2896120 Jan 17 '19 at 07:10
  • when you repartition by column you also specify the number of partitions. You just instruct spark by which columns it should partition. So let's say you partition by column A, and set 100 partitions. If cardinality of A is very low, then you will just have data skewness, and likely to end up with some of the partitions with no data. – Lior Chaga Jan 17 '19 at 07:38