29

if i write

dataFrame.write.format("parquet").mode("append").save("temp.parquet")

in temp.parquet folder i got the same file numbers as the row numbers

i think i'm not fully understand about parquet but is it natural?

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
Easyhyum
  • 307
  • 1
  • 3
  • 5

3 Answers3

27

Use coalesce before write operation

dataFrame.coalesce(1).write.format("parquet").mode("append").save("temp.parquet")


EDIT-1

Upon a closer look, the docs do warn about coalesce

However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1)

Therefore as suggested by @Amar, it's better to use repartition

y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
  • 1
    I have read elsewhere that coalesce is more performant. Who should we believe? – thebluephantom Aug 01 '18 at 11:37
  • 5
    While `coalesce` [minimizes data-movement](https://stackoverflow.com/a/31612810/3679900), the resulting *partitions* are not necessarily (in fact, unlikely) of same-size. So it's really a trade-off between less shuffle-*overhead* and (*almost*) equal-sized partitions. **[1]** Therefore, *in general*, it's best to use `coalesce` and fall back to `repartition` only when degradation is observed **[2]** However in this particular case of `numPartitions=1`, the docs stress that `repartition` would be a better choice – y2k-shubham Aug 01 '18 at 11:52
  • I meant the shuffle and always had the impression this is the most important aspect, but I take your point, which was my point. Interesting. – thebluephantom Aug 01 '18 at 11:55
  • Thank you y2k-shubham, the bluephantom I've got what i want!! – Easyhyum Aug 02 '18 at 00:28
15

You can set partitions as 1 to save as single file

dataFrame.repartition(1).write.format("parquet").mode("append").save("temp.parquet")
Klim
  • 142
  • 2
  • 9
Amar
  • 218
  • 1
  • 4
  • 10
    Note that `repartition(1)` should come before `write` since it is a method of [`Dataset`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@repartition(numPartitions:Int):org.apache.spark.sql.Dataset[T]) and not [`DataFrameWriter`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter) – y2k-shubham Aug 01 '18 at 11:03
6

Although previous answers are correct you have to understand repercusions that come after repartitioning or coalescing to a single partition. All your data will have to be transferred to a single worker just to immediately write it to a single file.

As it is repeatidly mentioned throughout the internet, you should use repartition in this scenario despite the shuffle step that gets added to the execution plan. This step helps to use your cluster's power instead of sequentially merging files.

There is at least one alternative worth mentioning. You can write a simple script that would merge all the files into a single one. That way you will avoid generating massive network traffic to a single node of your cluster.

bottaio
  • 4,963
  • 3
  • 19
  • 43