Questions tagged [s3distcp]
60 questions
9
votes
1 answer
How to EMR S3DistCp groupBy properly?
I am using aws .net sdk to run a s3distcp job to EMR to concatenate all files in a folder with --groupBy arg. But whatever "groupBy" arg I have tried, it failed all the time or just copy the files without concatenating like if no --groupBy specified…

Barbaros Alp
- 6,405
- 8
- 47
- 61
7
votes
2 answers
Use S3DistCp to copy file from S3 to EMR
I am struggling to find a way to use S3DistCp in my AWS EMR Cluster.
Some old examples which show how to add s3distcp as an EMR step use elastic-mapreduce command which is not used anymore.
Some other sources suggest to use s3-dist-cp command, which…

V. Samma
- 2,558
- 8
- 30
- 34
6
votes
3 answers
Hadoop distcp No AWS Credentials provided
I have a huge bucket of S3files that I want to put on HDFS. Given the amount of files involved my preferred solution is to use 'distributed copy'. However for some reason I can't get hadoop distcp to take my Amazon S3 credentials. The command I use…

KDC
- 1,441
- 5
- 19
- 36
5
votes
0 answers
Overwrite an existing file in S3 using S3DistCp
I am trying to use S3-Dist-Cp Command to write out a log file to my S3 Bucket from EMR cluster after daily run. If the file (with the same name) already exists in the S3 file, S3-Dist-Cp writes a new file with and appends a number to filename.
Is…

amith tumu
- 63
- 4
5
votes
3 answers
S3-Dist-Cp Failing on EMR5
I am facing issues with s3-dist-cp command in emr-5.0.0 version. In my application, I need to push some files from hdfs to S3. I am using s3-dist-cp command to achieve this. It was working fine in emr-4.2.0. But its not working in emr-5.0.0. If I…

bipulendra
- 51
- 1
- 2
5
votes
3 answers
How to avoid "Not a file" exceptions when reading from HDFS with spark
I copy a tree of files from S3 to HDFS with S3DistCP in an initial EMR step. hdfs dfs -ls -R hdfs:///data_dir shows the expected files, which look something…

Rob Cowie
- 22,259
- 6
- 62
- 56
4
votes
2 answers
use s3-dist-cp to merge parquet files
just wonder if it's possible to use s3-dist-cp tool to merge parquet file (snappy compressed). I tried with "--groupBy" and "--targetSize" options and it did merge the small files into bigger files. But I then can't read them within Spark or AWS…

seiya
- 1,477
- 3
- 17
- 26
3
votes
0 answers
s3distcp failing to copy from HDFS to S3
I'm trying to copy a csv file from HDFS to S3 but the job fails with these errors:
Error: java.lang.RuntimeException: Reducer task failed to copy 1 files:…

almostkorean
- 41
- 2
3
votes
2 answers
Hadoop distcp -possible to keep each file identical (retain file size)?
When I run a simple distcp command:
hadoop distcp s3://src-bucket/src-dir s3://dest-bucket/dest-dir
I get a slight discrepancy on the size (in bytes) of src-dir and dest-dir
>aws s3 --summarize s3://dest-bucket/dest-dir/
...
Total Objects: 12290
…

pl0u
- 365
- 7
- 16
3
votes
0 answers
s3distcp - takes long time to copy large number of small files from one bucket to another
I need to copy large number of small files from one S3 bucket to another. I'm using S3-Dist-Cp command provided by AWS.
s3-dist-cp --src=s3://some-bucket/ --dest=s3://another-bucket/ --groupBy= --targetSize=…

hlagvankar
- 219
- 1
- 3
- 12
3
votes
0 answers
Reduce File Size with s3-dist-cp (--targetSize not working)
I am running a job on AWS EMR (AMI 5.2). I have large files in S3, that I would like to copy and split into another S3 location using s3-dist-cp. Here is the command I am using:
s3-dist-cp --src=s3://my-bucket/dir1/ --dest=s3://my-bucket/dir2/…

DJElbow
- 3,345
- 11
- 41
- 52
2
votes
2 answers
How can I execute a S3-dist-cp command within a spark-submit application
I have a jar file that is being provided to spark-submit.With in the method in a jar. I’m trying to do a
Import sys.process._
s3-dist-cp —src hdfs:///tasks/ —dest s3://
I also installed s3-dist-cp on all salves along with…

Ram
- 159
- 1
- 10
2
votes
1 answer
move data from hdfs to s3 using session based token auth
Can someone please help me with authentication while moving the data from hdfs to S3.
To connect to S3, I am generating session based credentials using aws_key_gen (access_key, secret_key, and session based token)
I tested, distcp works fine with…

Manu Batham
- 331
- 1
- 14
2
votes
1 answer
Copying S3 Files across AWS account using s3-dist-cp
I have requirement where I need to copy files from one S3 bucket to other S3 bucket. These buckets are present in different AWS account.
I tried using s3 sync command. But, for this destination IAM user should be given with read access on source…

Kishor Baindoor
- 111
- 3
- 9
2
votes
2 answers
slow s3Distcp when copying from s3 to hdfs
I am using s3disctcp to copy 31,16,886 files(300 GB) from S3 to HDFS and it took 4 days to just copy 10,48,576 files .I killed the job and need to understand how can i reduce this time or what am i doing wrong.
s3-dist-cp --src s3://xml-prod/ --dest…

Priyanka O
- 21
- 2