Questions tagged [distcp]

hadoop tool used for large inter- and intra-cluster copying.

The distcp command is a tool used for large inter- and intra- copying. It uses to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

181 questions
9
votes
1 answer

How to EMR S3DistCp groupBy properly?

I am using aws .net sdk to run a s3distcp job to EMR to concatenate all files in a folder with --groupBy arg. But whatever "groupBy" arg I have tried, it failed all the time or just copy the files without concatenating like if no --groupBy specified…
Barbaros Alp
  • 6,405
  • 8
  • 47
  • 61
9
votes
3 answers

Hadoop: specify yarn queue for distcp

On our cluster we have set up dynamic resource pools. The rules are set so that first yarn will look at the specified queue, then to the username, then to primary group ... However with a distcp I can't seem to be able to specify a queue, it just…
Havnar
  • 2,558
  • 7
  • 33
  • 62
6
votes
3 answers

Hadoop distcp No AWS Credentials provided

I have a huge bucket of S3files that I want to put on HDFS. Given the amount of files involved my preferred solution is to use 'distributed copy'. However for some reason I can't get hadoop distcp to take my Amazon S3 credentials. The command I use…
KDC
  • 1,441
  • 5
  • 19
  • 36
6
votes
1 answer

How to do I run encrypted distcp from hdfs to s3?

I like to copy data from our hadoop cluster (on premise) to s3. I can do it unencrypted. I can also run s3cmd put with client side encryption. How do I do distcp with client side encryption ?
questionersam
  • 1,115
  • 1
  • 11
  • 24
5
votes
0 answers

issue when copying hive partitioned table from one cluster to another

I run a distcp command to copy the hdfs location of a table to another cluster. The copy is scheduled to run every 8 hours. I run the 'msck repair table' command but not always after the copy. I have noticed that sometimes some partitions of the…
RRy
  • 433
  • 1
  • 5
  • 15
5
votes
0 answers

java.io.IOException: Error writing request body to server while submitting DistCp job

When i submit distcp job to copy files from insecured hadoop cluster to secured(kerberized) cluster, I encountered an error below: 2017-08-28 17:42:09,526 FATAL [IPC Server handler 0 on 42131] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:…
5
votes
2 answers

hadoop fs -rm -skipTrash doesn't work

I copied some files from a directory to directory using hadoop distcp -Dmapreduce.job.queuename=adhoc /user/comverse/data/$CURRENT_DATE_NO_DASH_*/*rcr.gz /apps/hive/warehouse/arstel.db/fair_usage/fct_evkuzmin04/file_rcr/ I stopped the scipt before…
Evgenii
  • 389
  • 3
  • 7
  • 21
5
votes
1 answer

distcp failing with error "No space left on device"

I am copying HDFS snapshot to S3 bucket, getting below error: The command i am executing is: hadoop distcp /.snapshot/$SNAPSHOTNAME s3a://$ACCESSKEY:$SECRETKEY@$BUCKET/$SNAPSHOTNAME 15/08/20 06:50:07 INFO mapreduce.Job: map 38% reduce…
user3640472
  • 115
  • 9
5
votes
1 answer

Hadoop DistCp handle same file name by renaming

Is there any way to run DistCp, but with an option to rename on file name collisions? Maybe it's easiest to explain with an example. Let's say I'm copying to hdfs:///foo to hdfs:///bar, and foo contains these…
Joe K
  • 18,204
  • 2
  • 36
  • 58
4
votes
1 answer

Hadoop Distcp - increasing distcp.dynamic.max.chunks.tolerable config and tuning distcp

I am trying to move data between two hadoop clusters using distcp. There is a lot of data to move with a large number of small files. In order to make it faster, I tried using -strategy dynamic, which according to the documentation, 'allows faster…
Hemanth
  • 705
  • 2
  • 16
  • 32
4
votes
0 answers

Getting DuplicateFileException (Records would cause duplicates) when copy data in between hadoop cluster using distcp

I am copying all files from one Hadoop cluster to another Hadoop cluster using distcp. On 1st attempt copied all data but on 2nd back of data getting exception DuplicateFileException (Records would cause duplicates). for more detail check bellow log…
xyz_scala
  • 463
  • 1
  • 4
  • 21
4
votes
5 answers

Hdfs to s3 Distcp - Access Keys

For copying the file from HDFS to S3 bucket I used the command hadoop distcp -Dfs.s3a.access.key=ACCESS_KEY_HERE\ -Dfs.s3a.secret.key=SECRET_KEY_HERE /path/in/hdfs s3a:/BUCKET NAME But the access key and sectet key are visible here which are not…
Vishal
  • 1,442
  • 3
  • 29
  • 48
4
votes
0 answers

How to tell distcp to ignore "file not found ..." and fall through to the next files?

We have a full HDFS backup using distcp that takes a long time to run, some of the data on HDFS is "moving", that is it is created and deleted. This results in mappers failing with java.io.FileNotFoundException: No such file or directory. Such…
samthebest
  • 30,803
  • 25
  • 102
  • 142
4
votes
1 answer

Hadoop distcp with partition

I am trying to do distcp from one system to other with same configurations(say A to B). But the partitions that i created in A are not showing up in B after distcp from A to B. I have to manually create the partitions in B. I have gone through set…
timma
  • 223
  • 1
  • 5
  • 15
4
votes
1 answer

Data ingestion in Hadoop using Distcp

I understand that distcp is used for inter/intra cluster transfer of data. Is it possible to use distcp to ingest data from the local file system to HDFS. I understand that you can use file:///.... to point to a local file outside of HDFS but how…
bytebiscuit
  • 3,446
  • 10
  • 33
  • 53
1
2 3
12 13