Append/concatenate two files using spark/scala

Question

I have multiple files stored in HDFS, and I need to merge them into one file using spark. However, because this operation is done frequently (every hour). I need to append those multiple files to the source file.

I found that there is the FileUtil that gives the 'copymerge' function. but it doesn't allow to append two files.

Thank you for your help

how about simple `union` or `unionAll` ? – Pushkr Mar 29 '17 at 13:17 — Pushkr, Mar 29 '17 at 13:17

score 2 · Answer 1 · edited May 23 '17 at 11:54

2

You can do this with two methods:

 sc.textFile("path/source", "path/file1", "path/file2").coalesce(1).saveAsTextFile("path/newSource")

Or as @Pushkr has proposed

 new UnionRDD(sc, Seq(sc.textFile("path/source"), sc.textFile("path/file1"),..)).coalesce(1).saveAsTextFile("path/newSource")

If you don't want to create a new source and overwrite the same source every hour, you can use dataframe with save mode overwrite ( How to overwrite the output directory in spark)

edited May 23 '17 at 11:54

Community

1
1

answered Mar 29 '17 at 17:37

Mehrez

685
8
14

Thank you for your answer. Does these methods allows to merge files in the following path : folder1/**/* . Knowing that I don't know the path ** – manie Mar 30 '17 at 08:46
I failed in using *. I have this path : folder1/unkownFolder2/unknownFolder3/knownFolder4/files . and I want to merge the files by giving this path : folder1/**/**/knownFolder4/f*.csv .Can you show me the correct way to use * – manie Mar 31 '17 at 04:28
with this sc.textfile("folder1/*/*/knownFolder4/f.csv") – Mehrez Mar 31 '17 at 09:31
thank you for your reply. It worked with ''*/*'' . However, I still have a problem. I need to save my data as csv and not as a foler or part-0000.gz . In the Overwritten discussion, they delete the destination content to be able to resave after. In my case, I can't do that as every hour the content is new. In brief, my only blocking point is the append content. – manie Apr 03 '17 at 09:06
Does this still work if the files have broken lines between them? I.e. file1 ends with half a line with no linefeed, file2 starts with the rest of file1's last line? – runrig Sep 17 '20 at 19:41

Append/concatenate two files using spark/scala

1 Answers1