Effort is to be able to do hadoop fs -ls on a directory for a date range (like 20170517 to 20170521) of the format (/a/b/c/d/e/f/g/h/test/exp_dt=YYYY-MM-DD). Is there a way I can capture/know if the directories for given range of dates fall under timestamp range you give. It will help to differentiate if the the partitions are of old run or new run for same dates.
eg:
startdate=20180517
enddate=20180521
timestamp1=2018-05-18 13:00
timestamp2=2018-05-22 13:00
inputPath=/a/b/c/d/e/f/g/h/test/
hlsCmd=`hadoop fs -ls $inputPath | awk '{timestamp = $6 ; hourMin = $7 ; path = $8 ; print timestamp; print hourMin; print path; print ","}'`
echo $hlsCmd
ingestFlag=1
startdate=20180517
enddate=20180521
date="$enddate"
dates=()
for (( date="$enddate" , cnt=1, missCnt=0, foundCnt=0 ; $date >= $startdate ; date="$(date --date="$date - 1 days" +'%Y%m%d')" , cnt++));
do
dates+=( "$date" )
if [ $ingestFlag == 1 ]; then
curDate="$(date --date="$date" +'%Y-%m-%d')"
else
curDate="$(date --date="$date" +'%Y/%m/%d')"
fi;
curDateYYYYMMDD="$(date --date="$date" +'%Y%m%d')"
fmeYYYYMM="$(date --date="$date + 1 month" +'%Y%m')"
if echo "$hlsCmd" | grep -q "$curDate" ; then
((foundCnt++))
echo "$inputPath : $curDate found"
# echo "$inputPath : $curDate found" >> $foundFileName;
else
((missCnt++))
echo "$inputPath : $curDate missing $curDateYYYYMMDD"
echo "$inputPath : $curDate missing $curDateYYYYMMDD" >> $missingFileName;
fi;
output:
/a/b/c/d/e/f/g/h/test/ : 2018-05-21 found
/a/b/c/d/e/f/g/h/test/ : 2018-05-20 found
/a/b/c/d/e/f/g/h/test/ : 2018-05-19 found
/a/b/c/d/e/f/g/h/test/ : 2018-05-18 found
/a/b/c/d/e/f/g/h/test/ : 2018-05-17 found
sample output of $hlsCmd=, 2018-06-06 10:33 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-03 , 2018-06-07 12:30 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-04 , 2018-06-08 10:48 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-05 , 2018-06-08 14:38 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-06 , 2018-06-09 10:23 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-07 , 2018-06-10 11:13 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-08 , 2018-06-11 10:43 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-09 , 2018-06-12 11:16 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-10
Blocker: Problem is that awk in the above code can pattern match with time stamp of directory (YYYY-MM-DD) and throw positive results. The effort is to see if the directories of certain range fall under certain timestamp. Please let me know what can be done.