0

I want to delete files in our hdfs based on their age (no of days).
The directory structure there has a fixed path followed by id/year/month/date/hour/min as their sub directories.

I am still a beginner here but the obvious choice looks like iterating through every folder and then delete.

But here we are talking millions of documents on hourly basis.
I would like to know the best approach towards this.

Phantômaxx
  • 37,901
  • 21
  • 84
  • 115
Sagar Saxena
  • 564
  • 5
  • 12

1 Answers1

0

based on their creation date in Java

Unclear if the "creation date" means time the file is written to HDFS, or that in the filepath. I'll assume it's the filepath.

here we are talking millions of documents on hourly basis

Doesn't really matter. You can delete entire folder paths, like a regular filesystem. Just use bash and the hdfs cli. If you need something special, all the CLI filesystem commands are mapped to Java classes.

Delete hdfs folder from java

If using bash, calculate the date using date command, subtracting the number of days, assign to a variable, let's say d. Make sure it's formatted to match the directory structure.

Ideally, don't just calculate the day. You want years and months to be computed in the date subtraction calculation.

Then simply remove everything in the path

 hadoop fs -rm -R "${FIXED_PATH}/id/$(d}"

You can delete many dates in a loop - Bash: Looping through dates

The only reason you would need to iterate anything else is if you have dynamic IDs you're trying to remove


Another way would be create a (partitioned) ACID-enabled Hive table over that data.

Simply execute a delete query similar to below (correctly accounting for the date formats)

DELETE FROM t 
WHERE CONCAT(year, '-', month, '-', day) < date_sub(current_date(), ${d})

Schedule it in a cron (or Oozie) task to have it repeatedly clean out old data.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • I need this done with Java. Nothing else. Bash was my goto as well. Pretty comfortable there. But need this to be done in a java application. – Sagar Saxena Feb 06 '18 at 04:29
  • 1
    Okay, then get comfortable with `SimpleDateFormat` and the FileSystem class API https://hadoop.apache.org/docs/r2.7.3/api/index.html?org/apache/hadoop/fs/FileSystem.html – OneCricketeer Feb 06 '18 at 04:43
  • 1
    https://stackoverflow.com/questions/28767607/delete-hdfs-folder-from-java – OneCricketeer Feb 06 '18 at 04:44