pass multiple files based on date in same directory as Input to Mapreduce

Question

I have requirement where I have to user multiple files from same directory with specific date as a input to mapreduce job.

not sure how I can do it.

hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/*.snappy /user/hdfs/eventlog_output/op1

Example : from eventlog directory I need only present date file for processing.

eventlog directory has gets log data from a flume logger agent so it has 1000 of new files coming on daily basis. I that I need only present date file for my process.

Thanks.

Regards, Mohan.

score 0 · Accepted Answer · edited May 23 '17 at 12:17

you can use bash date command as $(date +%Y-%m-%d):

for example, running as below will look for /user/hdfs/eventlog/2017-01-04.snappy log file and output will be stored to /user/hdfs/eventlog_output/2017-01-04 hdfs dir:

hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$(date +%Y-%m-%d).snappy /user/hdfs/eventlog_output/$(date +%Y-%m-%d)

to get specific date format see this answer OR type man date command to learn more about date...

update after more details provided:

1. explanation:

$ file=$(hadoop fs -ls /user/cloudera/*.snappy|grep $(date +%Y-%m-%d)|awk '{print $NF}')
$ echo $file
/user/cloudera/xyz.snappy
$ file_out=$(echo $file|awk -F '/' '{print $NF}'|awk -F '.' '{print $1}')
$ echo $file_out
xyz
$hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$file /user/hdfs/eventlog_output/$file_out

2. make shell script to reuse these commands daily... and in more logical way

This script can process more than one files in hdfs for present system date:

#!/bin/sh
#get today's snappy files
files=$(hadoop fs -ls /user/hdfs/eventlog/*.snappy|grep $(date +%Y-%m-%d)|awk '{print $NF}')

#Only process if today's file(s) available...
if [ $? -eq 0 ]
then   
    # file(s) found now create dir
    hadoop fs -mkdir /user/hdfs/eventlog/$(date +%Y-%m-%d)
    counter=0
        #move each file to today's dir
        for file in $files
        do
            hadoop fs -mv $file /user/hdfs/eventlog/$(date +%Y-%m-%d)/
            counter=$(($counter + 1))
        done
    #run hadoop job
    hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$(date +%Y-%m-%d) /user/hdfs/eventlog_output/$(date +%Y-%m-%d)
fi

echo "Total processed file(s): $counter"
echo "Done processing today's file(s)..."

This script can process more than one files - one file at time - in hdfs for present system date:

#!/bin/sh   
#get today's snappy files
files=$(hadoop fs -ls /user/hdfs/eventlog/*.snappy|grep $(date +%Y-%m-%d)|awk '{print $NF}')

#Only process if today's file(s) available...
if [ $? -eq 0 ]
then
counter=0
    for file in $files
    do    
        echo "Processing file: $file ..."    
        #get output dir name
        file_out=$(echo $file|awk -F '/' '{print $NF}'|awk -F '.' '{print $1}')

        #run hadoop job
        hadoop jar EventLogsSW.jar EventSuspiciousWatch /user/hdfs/eventlog/$file /user/hdfs/eventlog_output/$file_out

        counter=$(($counter + 1))
    done
fi

echo "Total processed file(s): $counter"
echo "Done processing today's file(s)..."

thanks for the response. the file name doesn't have any date in it. Eg : --199346735859.snappy — Mohan M, Jan 05 '17 at 10:51
but with this will process files as one at time....running all in one single hadoop job may be possible by moving all files to be processed in new dir and then run hadoop job on that dir — Ronak Patel, Jan 05 '17 at 13:37

pass multiple files based on date in same directory as Input to Mapreduce

1 Answers1