1

I am trying to write a bash script that is run on the cron every day, which looks through an hdfs location and removes the files in that path that have been there for more than a week. I have done quite a bit of research on different bash commands but to be honest I have no idea where to start. Can anybody help me with this or at least steer me in the right direction?

To be clear here. I have never written a bash script, which is why I have no idea where to start with this.

sickClick
  • 21
  • 4

2 Answers2

0

How about this:

DAYS=7; # a week
hdfs dfs -ls -R /user/hduser/input | grep -v "^d" | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk -v DAYS="$DAYS" 'BEGIN{ BEFORE=24*60*60*DAYS; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > BEFORE){ system("echo "$3) }}'

Where,

Get list of all the files at specified location:

hdfs dfs -ls -R /user/hduser/input

Remove the directories from the output list (as we just want to delete files):

grep -v "^d"

Replace extra spaces to make the output space separated:

tr -s " "

Get the required columns:

cut -d' ' -f6-8

Remove non-required rows:

grep "^[0-9]"

Processing using awk: Pass the value to awk script for threshold to be considered for deletion:

awk -v DAYS="$DAYS"

Calculate the value in seconds for provided DAYS:

BEFORE=24*60*60*DAYS;

Get the current timestamp in seconds:

"date +%s" | getline NOW

Create a command to get the epoch value for timestamp of the file on HDFS:

cmd="date -d'\''"$1" "$2"'\'' +%s";

Execute the command to get epoch value for HDFS file:

cmd | getline WHEN;

Get the time difference:

DIFF=NOW-WHEN;

Print the file location depending upon the difference:

if(DIFF > BEFORE){ print $3 }

Above commands will just list the files which are older than the specified number of DAYS. So try them first, once you are sure that it is working fine, then here are the actual commands you need to DELETE the files from HDFS:

DAYS=7; # a week
hdfs dfs -ls -R /user/hduser/input | grep -v "^d" | tr -s " " | cut -d' ' -f6-8 | grep "^[0-9]" | awk -v DAYS="$DAYS" 'BEGIN{ BEFORE=24*60*60*DAYS; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > BEFORE){ system("hdfs dfs -rm "$3) }}'

Hope This Helps !!

PradeepKumbhar
  • 3,361
  • 1
  • 18
  • 31
  • 1
    If you end up piping to Awk anyway, it make sense to refactor most of the rest of that hideous pipeline into the Awk script as well. – tripleee Sep 13 '16 at 06:52
  • You don't need `export` here. – tripleee Sep 13 '16 at 06:53
  • @tripleee, thanks for the feedback and yes I agree with your points. But I used pipes because I am just not aware of the AWK ways to do the same. Maybe I'll edit my answer once I get it. – PradeepKumbhar Sep 13 '16 at 08:21
  • Hello @daemon12 i want to delete some partitions from my hive table; for example `mytimestamp < 20180528104344` – Zied Hermi May 28 '18 at 10:28
-2

so if HDFS it's a file extension, so you can add to cron the next line:

find /path/where/you/wannna -type f -name *.hdfs | xargs rm -rf 

thats should remove all files with .hdfs in the extension.

OlegM
  • 1
  • 1