2

I have an hadoop job that outputs many parts to hdfs for example to some folder.

For example:

/output/s3/2014-09-10/part...

What is the best way, using s3 java api to upload those parts to signle file in s3

For example

s3:/jobBucket/output-file-2014-09-10.csv

As a possible solution there is an option to merge the parts and write the result to hdfs single file, but this will create a double I/O. Using single reducer is not option as well

Thanks,

Julias
  • 5,752
  • 17
  • 59
  • 84

3 Answers3

1

Try to use FileUtil#copyMerge method, it allows you to copy data between two file systems. Also i found S3DistCp tool that can copy data from HDFS to Amazon S3. You can specify --groupBy,(.*) option to merge the files.

Aleksei Shestakov
  • 2,508
  • 2
  • 13
  • 14
1

Snippet for Spark process

void sparkProcess(){
    SparkConf sparkConf = new SparkConf().setAppName("name");
    JavaSparkContext sc = new JavaSparkContext(sparkConf)
    Configuration hadoopConf = sc.hadoopConfiguration();
    hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKey);
    hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretKey);
    String folderPath = "s3://bucket/output/folder";
    String mergedFilePath = "s3://bucket/output/result.txt";
    BatchFileUtil.copyMerge(hadoopConf, folderPath, mergedFilePath);
}    

public static boolean copyMerge(Configuration hadoopConfig, String srcPath, String dstPath) throws IOException, URISyntaxException {
    FileSystem hdfs = FileSystem.get(new URI(srcPath), hadoopConfig);
    return FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null);
}
meeza
  • 664
  • 1
  • 9
  • 20
0

Use the java hdfs api to read the files, then use standard Java streamy type stuff to convert to a InputStream, then use

http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/PutObjectRequest.html

See also

https://stackoverflow.com/a/11116119/1586965

Community
  • 1
  • 1
samthebest
  • 30,803
  • 25
  • 102
  • 142