2

I have a jar file that is being provided to spark-submit.With in the method in a jar. I’m trying to do a

Import sys.process._
s3-dist-cp —src hdfs:///tasks/ —dest s3://<destination-bucket>

I also installed s3-dist-cp on all salves along with master. The application starts and succeeded without error but does not move the data to S3.

Ram
  • 159
  • 1
  • 10

2 Answers2

1

This isn't a proper direct answer to your question, but I've used hadoop distcp (https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html) instead and it sucessfully moved the data. In my tests it's quite slow compared to spark.write.parquet(path) though (when accounting in the time taken by the additional write to hdfs that is required in order to use hadoop distcp). I'm also very interested in the answer to your question though; I think s3-dist-cp might be faster given the aditional optimizations done by Amazon.

habit
  • 431
  • 3
  • 11
  • Please add this as a comment instead of an answer. You already say it is not a direct answer! – gosuto Jan 02 '19 at 22:01
  • My account doesn´t have enough rep to comment right now, if you or anyone else thinks it´d be more helpful to just remove this answer, let me know. – habit Jan 03 '19 at 17:30
1

s3-dist-cp is now a default thing on the Master node of the EMR cluster.

I was able to do an s3-dist-cp from with in the spark-submit successfully if the spark application is submitted in "client" mode.

Ram
  • 159
  • 1
  • 10