save data into single file in hdfs using spark

Question

I want to save a json data into single file in hdfs. currently my approach is to save data into hdfs using spark then merge the data into local (local_tmp_file) and then move it into hdfs (dest)

getmerge_command = 'hdfs dfs -getmerge ' + dest + ' ' + local_tmp_file
move_command = 'hdfs dfs -moveFromLocal ' + local_tmp_file + ' ' + dest

the problem happened when there is a lot of process run at the same time and use the temporary local storage which makes the disk full. Is anyone have any solution for this?

score 0 · Answer 1 · answered Nov 07 '17 at 08:33

0

when you are saving the data use repartition(1)

df.repartition(1).write.mode("overwrite").format("json").save("test_file")

answered Nov 07 '17 at 08:33

Ankit Kumar Namdeo

1,426
1
12
24

score 0 · Accepted Answer · answered Nov 07 '17 at 09:13

0

It's better to use coalesce() if we are decreasing the partitions as it more optimized version of repartition(), as it avoids full shuffling of data.

df.coalesce(1).write.mode("overwrite").format("json").save("test_file")

For more details on repartition and coalesce,check this, Spark - repartition() vs coalesce()

answered Nov 07 '17 at 09:13

Suresh

5,678
2
24
40

coalesce(1) and repartition(1) are same in functionality as whole data has to be transferred in one partition there has to be a shuffle and it will be same for both the operations, correct me if i am wrong. – Ankit Kumar Namdeo Nov 07 '17 at 09:51
The problem with this approach is the spark driver will be overquota since all the data will be pulled into it. – Aldy syahdeini Nov 08 '17 at 07:23

save data into single file in hdfs using spark

2 Answers2