0

I have a requirement, Huge data is partitioned and inserting it into Hive.To bind this data, I am using DF.Coalesce(10). Now i want to bind this portioned data to single directory, if I use DF.Coalesce(1) will the performance decrease? or do I have any other process to do so?

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • If you have "huge" data, putting everything in one partition is not recommended. You might end up overwhelmed your master node and thus it will fail... – eliasah Jan 23 '18 at 16:23
  • @eliasah: can you please suggest how i suppose to handle this scenario –  Jan 23 '18 at 16:26
  • I strongly suggest to read this question and answers: https://stackoverflow.com/questions/31674530/write-single-csv-file-using-spark-csv – T. Gawęda Jan 23 '18 at 16:54

1 Answers1

1

From what I understand is that you are trying to ensure that there are less no of files per partition. So, by using coalesce(10), you will get max 10 files per partition. I would suggest using repartition($"COL"), here COL is the column used to partition the data. This will ensure that your "huge" data is split based on the partition column used in HIVE. df.repartition($"COL")

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
BalaramRaju
  • 439
  • 2
  • 8
  • I guess `df.repartition($"COL")` wont help in my case... because i want data in a single directory... –  Jan 24 '18 at 04:03
  • If you are writing to the same directory you will need to just use `df.repartition(100)` to get 100 equal size files. because you are not partitioned, no need to use column to distribute the data. – BalaramRaju Jan 25 '18 at 19:36