2

I understand that my question is similar to Merge Output files after reduce phase, however I think it may be different because I am using Spark only a local machine and not actually a distributed file system.

I have Spark installed on a single VM (for testing). The output is given in several files (part-000000, part-000001, etc...) in a folder called 'STjoin' in Home/Spark_Hadoop/spark-1.1.1-bin-cdh4/.

The command hadoop fs -getmerge /Spark_Hadoop/spark-1.1.1-bin-cdh4/STjoin /desired/local/output/file.txt does not seem to work ("No such file or director")

Is this because this command only applies to files stored in HDFS and not locally, or am I not understanding linux addresses in general? (I am new to both linux and HDFS)

Community
  • 1
  • 1
Alexis Eggermont
  • 7,665
  • 24
  • 60
  • 93

1 Answers1

4

Simply do cat /path/to/source/dir/* > /path/to/output/file.txt. getmerge is the Hadoop version for HDFS-only files.

frb
  • 3,738
  • 2
  • 21
  • 51
  • what about header ..If all files have header will it merge with header also ? – Sudarshan kumar Oct 23 '17 at 09:33
  • Yes... that's the case the output of a Spark job are CSV part files. In that case you will have to be more creative... For instance, by removing the first line of the files before cat' ing them and, once merged, adding one single header line at the beginning of the resulting file. – frb Oct 24 '17 at 09:02