0

How can we achieve using Apache Pig :

File :

A  2014/10/01
A  2014/09/01
A  2014/08/01
A  2014/02/01

Result should A count 3, since i want to count the number of records using rolling window of 30 days between records group by A.

vikrant rana
  • 4,509
  • 6
  • 32
  • 72
jk7
  • 89
  • 1
  • 7
  • what is the expected output of this input? is it 4 or 5 "A 2014-12-01 A 2014-11-01 A 2014-10-01 A 2014-08-01 A 2014-06-01 A 2014-04-01 A 2014-03-01" – Sivasakthi Jayaraman Nov 07 '14 at 11:50
  • The Expected result is 5 , because A 2014-12-01 A 2014-11-01 A 2014-10-01 comes under 30 days rolling window and A 2014-04-01 A 2014-03-01. – jk7 Nov 08 '14 at 03:50
  • @Sivasakthi, can you pls help me to resolve this above. – jk7 Nov 09 '14 at 13:34
  • your requirement is not straight forward, so far i have completed the diff between each days but the total counting will be little bit challenging. for this i may need to write one UDF. – Sivasakthi Jayaraman Nov 09 '14 at 13:39
  • Hey Saivasakthi...any luck on my question. I am not able to get the answer,pls help – jk7 Nov 12 '14 at 21:55
  • Can you tell me the following scenarios. All the dates are start at 1st of every month. Is it possible that the date will come in the middle of month something like this (2014/10/15,2014/09/14,2014/18/10). In this case how to handle?. What happens if Feb month comes in between, in this case only 28 or 29 days? – Sivasakthi Jayaraman Nov 16 '14 at 18:39
  • Hi Siva..Thanks for your reply. Yes dates will come in any range for eg the date what you mentioned in above coment. The requirement is date for eg : 2014/10/01 , 2014/10/02 , 2014/10/03 , 2014/09/01 if we consider the 30 days rolling, it should consider only 3 dates 2014/10/01 , 2014/10/02 , 2014/10/03. – jk7 Nov 17 '14 at 16:43

1 Answers1

0

Please find the solution, i hope you can do further enhancement if it required. Try to execute with your input and let me know how it works.

input.txt

A 2014/12/01
A 2014/11/01
A 2014/10/01
A 2014/07/01
A 2014/05/01
A 2014/04/01
B 2014/09/01
B 2014/07/01
B 2014/06/01
B 2014/02/01
C 2014/09/01
C 2014/07/01
C 2014/05/01

Expected output

A 5
B 2
C 0

PigScript:

REGISTER rollingCount.jar;
A = LOAD 'input.txt' Using PigStorage(' ') AS (f1:chararray,f2:chararray);
B = GROUP A BY f1;
C = FOREACH B GENERATE mypackage.ROLLINGCOUNT(BagToString($1)) AS rollingCnt;
DUMP C;

OutPut from the Script:

(A,5)
(B,2)
(C,0)

Java Code:
1. Compile the below java code and create jar file name rollingCount.jar
2. I just wrote the code temporarily, you can optimize if required.

ROLLINGCOUNT.java

package mypackage;

import java.io.*;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import java.text.SimpleDateFormat;
import java.util.concurrent.TimeUnit;
import java.util.*;

public class ROLLINGCOUNT extends EvalFunc<Integer> {
    public Integer exec(Tuple input) throws IOException {

        //Get the input String from request
        String inputString = (String)input.get(0);
        Date[] arrayOfDates = getArrayOfDate(inputString);
        long diffDays[] = getDaysBetweenList(arrayOfDates);
        int rollingCount = getRollingCount(diffDays);

        return rollingCount;
    }

    //Function to convert strings to array of dates
    static protected Date[] getArrayOfDate(String inputString)
    {
        //Get the 1st column, this will be the Id
        String ID = inputString.split("_")[0];

        //Replace all the Ids with Null, bcoz its a duplicate columns
        String modifiedString = inputString.replace(ID+"_","");

        //Split the string into multiple columns using '_' as delimiter
        String list[] = modifiedString.split("_");

        //Convert the string to list of  array dates
        Date[] dateList = new Date[list.length];
        int index=0;
        for (String dateString: list)
        {
            try
            {
                //Convert the date string to date object in the give format
                SimpleDateFormat dFormat = new SimpleDateFormat("yyyy/MM/dd");
                dateList[index++] = dFormat.parse(dateString);
            }
            catch(Exception e)
            {
                // error handling goes here
            }
        }
        return dateList;
    }

    //Function to get difference between two dates
    static protected long[] getDaysBetweenList(Date[] arrayOfDate)
    {
        long diffDays[] = new long[arrayOfDate.length-1];
        int cnt=0;      
        for (int index=0; index<arrayOfDate.length-1;index++)
        {
            long diff = Math.abs(arrayOfDate[index+1].getTime() - arrayOfDate[index].getTime());    
            long days = TimeUnit.DAYS.convert(diff, TimeUnit.MILLISECONDS);
            diffDays[cnt++] = days;
        }
        return diffDays;
    }

    //Function to get the total rolling count   
    static protected int getRollingCount(long diffDays[])
    {
        int result =0;
        for(int index=0;index<diffDays.length;index++)
        {
            int cnt =0;
            //hardcoded the values of 30 and 31 days, may need to handle Feb month 28 or 29 days
            while((index<diffDays.length)&&((diffDays[index]==30)||(diffDays[index]==31)))
            {
                cnt++;
                index++;
            }
            if(cnt>0)
            {
                result = result + cnt+1;
            }       
        }
        return result;
    }
}
Sivasakthi Jayaraman
  • 4,724
  • 3
  • 17
  • 27
  • Hi Siva..Thanks for your reply. Yes dates will come in any range for eg the date what you mentioned in above coment. The requirement is date for eg : 2014/10/01 , 2014/10/02 , 2014/10/03 , 2014/09/01 if we consider the 30 days rolling, it should consider only 3 dates 2014/10/01 , 2014/10/02 , 2014/10/03 – jk7 Nov 17 '14 at 16:52
  • Please check this link to see how to compile and link the jar to script. http://stackoverflow.com/questions/26932995/left-padding-a-string-in-pig – Sivasakthi Jayaraman Nov 18 '14 at 00:33
  • Hi Siva....i compiled and it worked fine, Thanks a lot for your help.You Rocks ! I am trying various combination now. I will shoot you the question, if i have any question. – jk7 Nov 18 '14 at 01:06
  • Hi Shiva...one question..for the above example we are passing the tupple to the UDF and getting back the count. Is it a way,where we can get a list of count back.for eg : right we send the tuple to UDF as A_2014/01/01_A_2014/01/10_A_2014/02/01 and we are getting 3 back from UDF, i am doing further more research, lets say if we send : A_2014/01/01_A_2014/01/10_A_2014/02/01_A_2014/04/01_A_2014_04_10 is there a way can the UDF handle to return the list like 3,2 – jk7 Nov 18 '14 at 04:19