Deleting file/folder from Hadoop

Question

I'm running an EMR Activity inside a Data Pipeline analyzing log files and I get the following error when my Pipeline fails:

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://10.208.42.127:9000/home/hadoop/temp-output-s3copy already exists
    at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:944)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:905)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:905)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:879)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1316)
    at com.valtira.datapipeline.stream.CloudFrontStreamLogProcessors.main(CloudFrontStreamLogProcessors.java:216)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:187)

How can I delete that folder from Hadoop?

score 53 · Answer 1 · edited Aug 01 '14 at 07:16

53

When you say delete from Hadoop, you really mean delete from HDFS.

To delete something from HDFS do one of the two

From the command line:

deprecated way:

hadoop dfs -rmr hdfs://path/to/file

new way (with hadoop 2.4.1) :

hdfs dfs -rm -r hdfs://path/to/file

Or from java:

FileSystem fs = FileSystem.get(getConf());
fs.delete(new Path("path/to/file"), true); // delete file, true for recursive

edited Aug 01 '14 at 07:16

Guillaume Vauvert

441
6
15

answered May 28 '13 at 17:01

greedybuddha

7,488
3
36
50

path/to/file is "10.208.42.127:9000/home/hadoop/temp-output-s3copy"? Thanks! – cevallos.valtira May 28 '13 at 19:18
I haven't tested it yet. My question is should I use "10.208.42.127:9000/home/hadoop/temp-output-s3copy" as path/to/file? – cevallos.valtira May 28 '13 at 19:29
1

usually you just specify hdfs://home/hadoop/temp-output-s3copy, since files on hdfs are often replicated to several nodes. Are you doing this on a single node? – greedybuddha May 28 '13 at 19:38
Well if this folder is on HDFS then it should work. Though the path you gave makes me think its not on HDFS at all and instead is just a local folder. Are you doing this through command line or java? – greedybuddha May 28 '13 at 19:46
Im creating the pipeline through the command line, but my loganalyzer is done in Java – cevallos.valtira May 28 '13 at 19:49
So use the command line version `hadoop dfs -rmr hdfs://home/hadoop/temp-output-s3copy`. If that doesn't work, it's because it's not on the hdfs file system. If thats the case.. you can use `hadoop dfs -rmr file://home/hadoop/temp-output-s3copy`, or just the unix `rm -r` – greedybuddha May 28 '13 at 19:58
From Java, did you mean FileSystem fs = FileSystem.get(fs.getConf());? I added the fs.getConf() – cevallos.valtira May 28 '13 at 20:02
It really depends on the hadoop api version. Just use whatever you need to get the current configuraiton, if thats `fs.getConfg` then use that. – greedybuddha May 28 '13 at 20:03
It's not a local folder, so I'm pretty sure it is in Hadoop. I'll try this and see what happens. Thanks! – cevallos.valtira May 28 '13 at 20:18
So it worked the first time I run the EMRActivity. I run again using the same java class, same Pipeline configuration, but different dates and it doesn't work. I get the exact same error. The only difference that I see is in the numbers at hdfs://10.208.42.127:9000/home/hadoop/temp-output-s3copy already exists. Every new time I run the Pipeline, I get a different number. I do not know what that means. I was suggested to delete the output from S3, but it still failed. – cevallos.valtira May 29 '13 at 14:45
I contacted AWS support and it seemed that the problem was that the log files I was analyzing were very big and that created an issue with memory. I added to my pipeline definition "masterInstanceType" : "m1.xlarge" in the EMRCluster section and it worked. Thanks – cevallos.valtira May 30 '13 at 14:19
how can we achieve same with python ? – MapReddy Usthili Jun 05 '15 at 07:17
org.apache.hadoop.fs.FileSystem – David Portabella Sep 26 '16 at 16:32

score 15 · Answer 2 · answered Jul 04 '15 at 10:31

To delete a file from hdfs you can use below given command :

hadoop fs -rm -r -skipTrash /path_to_file/file_name

To delete a folder from hdfs you can use below given command :

hadoop fs -rm -r -skipTrash /folder_name

You need to use -skipTrash option otherwise error will be prompted.

score 7 · Answer 3 · answered Jul 27 '15 at 16:15

7

With Scala:

val fs:FileSystem = FileSystem.get(new URI(filePath), sc.hadoopConfiguration);
fs.delete(new Path(filePath), true) // true for recursive

sc is the SparkContext

answered Jul 27 '15 at 16:15

Josiah Yoder

3,321
4
40
58

Just what I was looking for: includes recursive flag and from sparkContext. – WestCoastProjects Nov 18 '15 at 22:06

Kishore Bhosale · Answer 4 · 2016-09-30T10:08:32.200

2

To delete a file from hdfs use the command: hadoop fs -rm -r /FolderName

edited Sep 30 '16 at 10:08

answered May 11 '15 at 12:32

Kishore Bhosale

549
5
7

score 1 · Accepted Answer · answered May 30 '13 at 19:56

1

I contacted AWS support and it seemed that the problem was that the log files I was analyzing were very big and that created an issue with memory. I added to my pipeline definition "masterInstanceType" : "m1.xlarge" in the EMRCluster section and it worked.

answered May 30 '13 at 19:56

cevallos.valtira

191
1
1
8

4

This is the answer to your question but not the answer to the question's title. – jds Jul 20 '15 at 16:32

score 1 · Answer 6 · answered May 12 '14 at 19:15

1

From the command line:

 hadoop fs -rm -r /folder

answered May 12 '14 at 19:15

grokster

5,919
1
36
22

score 0 · Answer 7 · answered Aug 10 '15 at 20:54

0

I use hadoop 2.6.0, the commande line 'hadoop fs -rm -r fileName.hib' works fine for deleting any hib file on my hdfs file sys

answered Aug 10 '15 at 20:54

Ahmed Dib

25
5

Deleting file/folder from Hadoop

7 Answers7

Linked