How about this:
DAYS=7; # a week
hdfs dfs -ls -R /user/hduser/input | grep -v "^d" | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk -v DAYS="$DAYS" 'BEGIN{ BEFORE=24*60*60*DAYS; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > BEFORE){ system("echo "$3) }}'
Where,
Get list of all the files at specified location:
hdfs dfs -ls -R /user/hduser/input
Remove the directories from the output list (as we just want to delete files):
grep -v "^d"
Replace extra spaces to make the output space separated:
tr -s " "
Get the required columns:
cut -d' ' -f6-8
Remove non-required rows:
grep "^[0-9]"
Processing using awk:
Pass the value to awk script for threshold to be considered for deletion:
awk -v DAYS="$DAYS"
Calculate the value in seconds for provided DAYS:
BEFORE=24*60*60*DAYS;
Get the current timestamp in seconds:
"date +%s" | getline NOW
Create a command to get the epoch value for timestamp of the file on HDFS:
cmd="date -d'\''"$1" "$2"'\'' +%s";
Execute the command to get epoch value for HDFS file:
cmd | getline WHEN;
Get the time difference:
DIFF=NOW-WHEN;
Print the file location depending upon the difference:
if(DIFF > BEFORE){ print $3 }
Above commands will just list the files which are older than the specified number of DAYS. So try them first, once you are sure that it is working fine, then here are the actual commands you need to DELETE the files from HDFS:
DAYS=7; # a week
hdfs dfs -ls -R /user/hduser/input | grep -v "^d" | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk -v DAYS="$DAYS" 'BEGIN{ BEFORE=24*60*60*DAYS; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > BEFORE){ system("hdfs dfs -rm "$3) }}'
Hope This Helps !!