I wish to know how to move data from an EMR cluster's HDFS file system to an S3 bucket. I recognize that I can write directly to S3 in Spark, but in principle it should also be straightforward to do it afterwards, and so far I have not found that to be true in practice.
AWS documentation recommends s3-dist-cp
for the purpose of moving data between HDFS and S3. The documentation for s3-dist-cp
states that the HDFS source should be specified in URL format, i.e., hdfs://path/to/file
. I have so far moved data between HDFS and my local file system by using hadoop fs -get
, which takes a syntax of path/to/file
rather than hdfs://path/to/file
. It is unclear how to map between the two.
I am working from an SSH into the master node. I tried the following, each with both two and three slashes:
hdfs:///[public IP]/path/to/file
hdfs:///[public IP]:8020/path/to/file
hdfs:///localhost/path/to/file
hdfs:///path/to/file
/path/to/file
(and many variants)
In each case, my command is formatted as per the documentation:
s3-dist-cp --src hdfs://... --dest s3://my-bucket/destination
I have tried with both individual files and whole directories. In each case, I get an error that the source file does not exist. What am I doing wrong?