5

I have multiple files stored in HDFS, and I need to merge them into one file using spark. However, because this operation is done frequently (every hour). I need to append those multiple files to the source file.

I found that there is the FileUtil that gives the 'copymerge' function. but it doesn't allow to append two files.

Thank you for your help

manie
  • 355
  • 1
  • 5
  • 13

1 Answers1

2

You can do this with two methods:

 sc.textFile("path/source", "path/file1", "path/file2").coalesce(1).saveAsTextFile("path/newSource")

Or as @Pushkr has proposed

 new UnionRDD(sc, Seq(sc.textFile("path/source"), sc.textFile("path/file1"),..)).coalesce(1).saveAsTextFile("path/newSource")

If you don't want to create a new source and overwrite the same source every hour, you can use dataframe with save mode overwrite ( How to overwrite the output directory in spark)

Community
  • 1
  • 1
Mehrez
  • 685
  • 8
  • 14
  • Thank you for your answer. Does these methods allows to merge files in the following path : folder1/**/* . Knowing that I don't know the path ** – manie Mar 30 '17 at 08:46
  • I failed in using *. I have this path : folder1/unkownFolder2/unknownFolder3/knownFolder4/files . and I want to merge the files by giving this path : folder1/**/**/knownFolder4/f*.csv .Can you show me the correct way to use * – manie Mar 31 '17 at 04:28
  • with this sc.textfile("folder1/*/*/knownFolder4/f.csv") – Mehrez Mar 31 '17 at 09:31
  • thank you for your reply. It worked with ''*/*'' . However, I still have a problem. I need to save my data as csv and not as a foler or part-0000.gz . In the Overwritten discussion, they delete the destination content to be able to resave after. In my case, I can't do that as every hour the content is new. In brief, my only blocking point is the append content. – manie Apr 03 '17 at 09:06
  • Does this still work if the files have broken lines between them? I.e. file1 ends with half a line with no linefeed, file2 starts with the rest of file1's last line? – runrig Sep 17 '20 at 19:41