16

I know hadoop version 2.7's FileUtil has the copyMerge function that merges multiple files into a new one.

But the copyMerge function is no longer supported per the API in the 3.0 version.

Any ideas on how to merge all files within a directory into a new single file in the 3.0 version of hadoop?

Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
Jeremy
  • 935
  • 5
  • 18
  • 33

3 Answers3

13

Since FileUtil.copyMerge() has been deprecated and removed from the API starting in version 3, we can always re-implement it ourselves.

Here is the original Java implementation from previous versions.

Here is a Scala translation:

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.io.IOUtils
import java.io.IOException

def copyMerge(
    srcFS: FileSystem, srcDir: Path,
    dstFS: FileSystem, dstFile: Path,
    deleteSource: Boolean, conf: Configuration
): Boolean = {

  if (dstFS.exists(dstFile)) {
    throw new IOException(s"Target $dstFile already exists")
  }

  // Source path is expected to be a directory:
  if (srcFS.getFileStatus(srcDir).isDirectory) {

    val outputFile = dstFS.create(dstFile)
    try {
      srcFS
        .listStatus(srcDir)
        .sortBy(_.getPath.getName)
        .collect {
          case status if status.isFile =>
            val inputFile = srcFS.open(status.getPath)
            try { IOUtils.copyBytes(inputFile, outputFile, conf, false) }
            finally { inputFile.close() }
        }
    } finally { outputFile.close() }

    if (deleteSource) srcFS.delete(srcDir, true) else true
  }
  else false
}
Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
11

FileUtil#copyMerge method has been removed. See details for the major change:

https://issues.apache.org/jira/browse/HADOOP-12967

https://issues.apache.org/jira/browse/HADOOP-11392

You can use getmerge

Usage: hadoop fs -getmerge [-nl]

Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally -nl can be set to enable adding a newline character (LF) at the end of each file. -skip-empty-file can be used to avoid unwanted newline characters in case of empty files.

Examples:

hadoop fs -getmerge -nl /src /opt/output.txt
hadoop fs -getmerge -nl /src/file1.txt /src/file2.txt /output.txt

Exit Code: Returns 0 on success and non-zero on error.

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#getmerge

ravi
  • 1,078
  • 2
  • 17
  • 31
  • This works but not in an effective manor. This merges the file to my local directory which first has a delay and second requires me to then put the file back on the HDFS server by copying the file from my local machine again. Is there no way to do the merge and generate the new file on the HDFS server? – Jeremy Feb 05 '17 at 17:15
  • Seems to be there is no direct method to merge multiple files into one without having to copy the new file from LFS to the HDFS. See this StackOverflow question: http://stackoverflow.com/questions/10607716/how-can-i-concatenate-two-files-in-hadoop-into-one-using-hadoop-fs-shell – ravi Feb 06 '17 at 04:49
  • 3
    That was my fear. I wonder why copyMerge was removed in the latest version. – Jeremy Feb 07 '17 at 14:08
  • I am wondering the same. I think it was a really effective method. Meanwhile, you can write a java code to achieve the same. – ravi Feb 07 '17 at 14:33
7

I had the same question and had to re-implement copyMerge (in PySpark though, but using the same API calls as original copyMerge).

Have no idea why there is no equivalent functionality in Hadoop 3. We have to merge files from an HDFS directory over to an HDFS file very often.

Here's implementation in pySpark I referenced above https://github.com/Tagar/stuff/blob/master/copyMerge.py

Tagar
  • 13,911
  • 6
  • 95
  • 110
  • 1
    I did some digging and here's why it was removed: https://issues.apache.org/jira/browse/HADOOP-11661. "FileUtil#copyMerge is currently unused in the Hadoop source tree. In branch-1, it had been part of the implementation of the hadoop fs -getmerge shell command. In branch-2, the code for that shell command was rewritten in a way that no longer requires this method." – Powers Jun 17 '20 at 21:22