based on their creation date in Java
Unclear if the "creation date" means time the file is written to HDFS, or that in the filepath. I'll assume it's the filepath.
here we are talking millions of documents on hourly basis
Doesn't really matter. You can delete entire folder paths, like a regular filesystem. Just use bash and the hdfs cli. If you need something special, all the CLI filesystem commands are mapped to Java classes.
Delete hdfs folder from java
If using bash, calculate the date using date
command, subtracting the number of days, assign to a variable, let's say d
. Make sure it's formatted to match the directory structure.
Ideally, don't just calculate the day. You want years and months to be computed in the date subtraction calculation.
Then simply remove everything in the path
hadoop fs -rm -R "${FIXED_PATH}/id/$(d}"
You can delete many dates in a loop - Bash: Looping through dates
The only reason you would need to iterate anything else is if you have dynamic IDs you're trying to remove
Another way would be create a (partitioned) ACID-enabled Hive table over that data.
Simply execute a delete query similar to below (correctly accounting for the date formats)
DELETE FROM t
WHERE CONCAT(year, '-', month, '-', day) < date_sub(current_date(), ${d})
Schedule it in a cron (or Oozie) task to have it repeatedly clean out old data.