How to coalesce large portioned data into single directory in spark/Hive

Question

I have a requirement, Huge data is partitioned and inserting it into Hive.To bind this data, I am using DF.Coalesce(10). Now i want to bind this portioned data to single directory, if I use DF.Coalesce(1) will the performance decrease? or do I have any other process to do so?

If you have "huge" data, putting everything in one partition is not recommended. You might end up overwhelmed your master node and thus it will fail... — eliasah, Jan 23 '18 at 16:23
@eliasah: can you please suggest how i suppose to handle this scenario — , Jan 23 '18 at 16:26
I strongly suggest to read this question and answers: https://stackoverflow.com/questions/31674530/write-single-csv-file-using-spark-csv — T. Gawęda, Jan 23 '18 at 16:54

score 1 · Answer 1 · edited Jan 23 '18 at 17:33

1

From what I understand is that you are trying to ensure that there are less no of files per partition. So, by using coalesce(10), you will get max 10 files per partition. I would suggest using repartition($"COL"), here COL is the column used to partition the data. This will ensure that your "huge" data is split based on the partition column used in HIVE. df.repartition($"COL")

edited Jan 23 '18 at 17:33

Alper t. Turker

34,230
9
83
115

answered Jan 23 '18 at 16:44

BalaramRaju

439
2
8

I guess `df.repartition($"COL")` wont help in my case... because i want data in a single directory... – Jan 24 '18 at 04:03
If you are writing to the same directory you will need to just use `df.repartition(100)` to get 100 equal size files. because you are not partitioned, no need to use column to distribute the data. – BalaramRaju Jan 25 '18 at 19:36

How to coalesce large portioned data into single directory in spark/Hive

1 Answers1