How can I execute a S3-dist-cp command within a spark-submit application

Question

I have a jar file that is being provided to spark-submit.With in the method in a jar. I’m trying to do a

Import sys.process._
s3-dist-cp —src hdfs:///tasks/ —dest s3://<destination-bucket>

I also installed s3-dist-cp on all salves along with master. The application starts and succeeded without error but does not move the data to S3.

habit · Answer 1 · 2019-01-03T21:37:47.317

1

This isn't a proper direct answer to your question, but I've used hadoop distcp (https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html) instead and it sucessfully moved the data. In my tests it's quite slow compared to spark.write.parquet(path) though (when accounting in the time taken by the additional write to hdfs that is required in order to use hadoop distcp). I'm also very interested in the answer to your question though; I think s3-dist-cp might be faster given the aditional optimizations done by Amazon.

edited Jan 03 '19 at 21:37

answered Jan 02 '19 at 20:42

habit

431
3
11

Please add this as a comment instead of an answer. You already say it is not a direct answer! – gosuto Jan 02 '19 at 22:01
My account doesn´t have enough rep to comment right now, if you or anyone else thinks it´d be more helpful to just remove this answer, let me know. – habit Jan 03 '19 at 17:30

score 1 · Accepted Answer · answered Jan 11 '19 at 04:56

1

s3-dist-cp is now a default thing on the Master node of the EMR cluster.

I was able to do an s3-dist-cp from with in the spark-submit successfully if the spark application is submitted in "client" mode.

answered Jan 11 '19 at 04:56

Ram

159
1
10

1

could you please comment the spark-submit script that does s3-dist-cp and moves file from hdfs to s3 – TheCodeCache May 24 '19 at 05:04

How can I execute a S3-dist-cp command within a spark-submit application

2 Answers2