I have a requirement,
Huge data is partitioned and inserting it into Hive.To bind this data, I am using DF.Coalesce(10)
. Now i want to bind this portioned data to single directory, if I use DF.Coalesce(1)
will the performance decrease? or do I have any other process to do so?
Asked
Active
Viewed 695 times
0

Alper t. Turker
- 34,230
- 9
- 83
- 115
-
If you have "huge" data, putting everything in one partition is not recommended. You might end up overwhelmed your master node and thus it will fail... – eliasah Jan 23 '18 at 16:23
-
@eliasah: can you please suggest how i suppose to handle this scenario – Jan 23 '18 at 16:26
-
I strongly suggest to read this question and answers: https://stackoverflow.com/questions/31674530/write-single-csv-file-using-spark-csv – T. Gawęda Jan 23 '18 at 16:54
1 Answers
1
From what I understand is that you are trying to ensure that there are less no of files per partition. So, by using coalesce(10)
, you will get max 10 files per partition. I would suggest using repartition($"COL")
, here COL is the column used to partition the data. This will ensure that your "huge" data is split based on the partition column used in HIVE. df.repartition($"COL")

Alper t. Turker
- 34,230
- 9
- 83
- 115

BalaramRaju
- 439
- 2
- 8
-
I guess `df.repartition($"COL")` wont help in my case... because i want data in a single directory... – Jan 24 '18 at 04:03
-
If you are writing to the same directory you will need to just use `df.repartition(100)` to get 100 equal size files. because you are not partitioned, no need to use column to distribute the data. – BalaramRaju Jan 25 '18 at 19:36