-1

I want to save a json data into single file in hdfs. currently my approach is to save data into hdfs using spark then merge the data into local (local_tmp_file) and then move it into hdfs (dest)

getmerge_command = 'hdfs dfs -getmerge ' + dest + ' ' + local_tmp_file
move_command = 'hdfs dfs -moveFromLocal ' + local_tmp_file + ' ' + dest

the problem happened when there is a lot of process run at the same time and use the temporary local storage which makes the disk full. Is anyone have any solution for this?

Aldy syahdeini
  • 349
  • 1
  • 4
  • 16

2 Answers2

0

when you are saving the data use repartition(1)

df.repartition(1).write.mode("overwrite").format("json").save("test_file")

Ankit Kumar Namdeo
  • 1,426
  • 1
  • 12
  • 24
0

It's better to use coalesce() if we are decreasing the partitions as it more optimized version of repartition(), as it avoids full shuffling of data.

df.coalesce(1).write.mode("overwrite").format("json").save("test_file")

For more details on repartition and coalesce,check this, Spark - repartition() vs coalesce()

Suresh
  • 5,678
  • 2
  • 24
  • 40
  • coalesce(1) and repartition(1) are same in functionality as whole data has to be transferred in one partition there has to be a shuffle and it will be same for both the operations, correct me if i am wrong. – Ankit Kumar Namdeo Nov 07 '17 at 09:51
  • The problem with this approach is the spark driver will be overquota since all the data will be pulled into it. – Aldy syahdeini Nov 08 '17 at 07:23