0

Effort is to be able to do hadoop fs -ls on a directory for a date range (like 20170517 to 20170521) of the format (/a/b/c/d/e/f/g/h/test/exp_dt=YYYY-MM-DD). Is there a way I can capture/know if the directories for given range of dates fall under timestamp range you give. It will help to differentiate if the the partitions are of old run or new run for same dates.

eg: 
startdate=20180517
enddate=20180521
timestamp1=2018-05-18 13:00
timestamp2=2018-05-22 13:00



inputPath=/a/b/c/d/e/f/g/h/test/
hlsCmd=`hadoop fs -ls $inputPath | awk '{timestamp = $6 ; hourMin = $7 ; path = $8 ; print timestamp; print hourMin; print path; print ","}'`

echo $hlsCmd


ingestFlag=1
startdate=20180517
enddate=20180521
date="$enddate"

dates=()

for (( date="$enddate" , cnt=1, missCnt=0, foundCnt=0 ; $date >= $startdate ; date="$(date --date="$date - 1 days" +'%Y%m%d')" , cnt++));
do
    dates+=( "$date" )
    if [ $ingestFlag == 1 ]; then
        curDate="$(date --date="$date" +'%Y-%m-%d')"
    else
        curDate="$(date --date="$date" +'%Y/%m/%d')"
    fi;
    curDateYYYYMMDD="$(date --date="$date" +'%Y%m%d')"
    fmeYYYYMM="$(date --date="$date + 1 month" +'%Y%m')"

    if echo "$hlsCmd" | grep -q "$curDate" ; then
        ((foundCnt++))
        echo "$inputPath : $curDate found"
     #   echo "$inputPath : $curDate found" >> $foundFileName;
    else
        ((missCnt++))
        echo "$inputPath : $curDate missing $curDateYYYYMMDD"
        echo "$inputPath : $curDate missing $curDateYYYYMMDD" >> $missingFileName;
    fi;



output:

/a/b/c/d/e/f/g/h/test/ : 2018-05-21 found
/a/b/c/d/e/f/g/h/test/ : 2018-05-20 found
/a/b/c/d/e/f/g/h/test/ : 2018-05-19 found
/a/b/c/d/e/f/g/h/test/ : 2018-05-18 found
/a/b/c/d/e/f/g/h/test/ : 2018-05-17 found


sample output of $hlsCmd=, 2018-06-06 10:33 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-03 , 2018-06-07 12:30 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-04 , 2018-06-08 10:48 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-05 , 2018-06-08 14:38 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-06 , 2018-06-09 10:23 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-07 , 2018-06-10 11:13 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-08 , 2018-06-11 10:43 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-09 , 2018-06-12 11:16 /a/b/c/d/e/f/g/h/test/exp_dt=2018-06-10

Blocker: Problem is that awk in the above code can pattern match with time stamp of directory (YYYY-MM-DD) and throw positive results. The effort is to see if the directories of certain range fall under certain timestamp. Please let me know what can be done.

codeforester
  • 39,467
  • 16
  • 112
  • 140
Nick
  • 69
  • 1
  • 1
  • 3
  • Possible duplicate of [Find files in terminal between a date range](https://stackoverflow.com/questions/18339307/find-files-in-terminal-between-a-date-range) – jeremysprofile Jun 18 '18 at 19:01
  • @jeremysprofile : Trying to get results for a path in Hadoop cluster (hdfs) not local file system. So it is not a duplicate. Thanks – Nick Jun 18 '18 at 19:46

0 Answers0