How to upload a multiple files from hdfs to single s3 file?

Question

I have an hadoop job that outputs many parts to hdfs for example to some folder.

For example:

/output/s3/2014-09-10/part...

What is the best way, using s3 java api to upload those parts to signle file in s3

For example

s3:/jobBucket/output-file-2014-09-10.csv

As a possible solution there is an option to merge the parts and write the result to hdfs single file, but this will create a double I/O. Using single reducer is not option as well

Thanks,

score 1 · Answer 1 · answered Sep 14 '14 at 08:21

1

Try to use FileUtil#copyMerge method, it allows you to copy data between two file systems. Also i found S3DistCp tool that can copy data from HDFS to Amazon S3. You can specify --groupBy,(.*) option to merge the files.

answered Sep 14 '14 at 08:21

Aleksei Shestakov

2,508
2
13
14

score 1 · Answer 2 · answered Apr 21 '18 at 12:05

Snippet for Spark process

void sparkProcess(){
    SparkConf sparkConf = new SparkConf().setAppName("name");
    JavaSparkContext sc = new JavaSparkContext(sparkConf)
    Configuration hadoopConf = sc.hadoopConfiguration();
    hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKey);
    hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretKey);
    String folderPath = "s3://bucket/output/folder";
    String mergedFilePath = "s3://bucket/output/result.txt";
    BatchFileUtil.copyMerge(hadoopConf, folderPath, mergedFilePath);
}    

public static boolean copyMerge(Configuration hadoopConfig, String srcPath, String dstPath) throws IOException, URISyntaxException {
    FileSystem hdfs = FileSystem.get(new URI(srcPath), hadoopConfig);
    return FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null);
}

score 0 · Answer 3 · edited May 23 '17 at 10:26

0

Use the java hdfs api to read the files, then use standard Java streamy type stuff to convert to a InputStream, then use

http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/PutObjectRequest.html

How to upload a multiple files from hdfs to single s3 file?

3 Answers3