I'm trying to detect a sequence in a column of my hive table. I have 3 columns (id, label, index). Each id has a sequence of labels and index is the ordering of the labels, like
id label index
a x 1
a y 2
a x 3
a y 4
b x 1
b y 2
b y 3
b y 4
b x 5
b y 6
I want to identify if the label sequence of x,y,x,y occurs.
I was thinking of trying a lead function to accomplish this like:
select id, index, label,
lead( label, 1) over (partition by id order by index) as l1_fac,
lead( label, 2) over (partition by id order by index) as l2_fac,
lead( label, 3) over (partition by id order by index) as l3_fac
from mytable
yields:
id index label l1_fac l2_fac l3_fac
a 1 x y x y
a 2 y x y NULL
a 3 x y NULL NULL
a 4 y NULL NULL NULL
b 1 x y y y
b 2 y y y x
b 3 y y x y
b 4 y x y NULL
b 5 x y NULL NULL
where l1(2,3) are the next label values. Then I could check for a pattern with
where label = l2_fac and l1_fac = l3_fac
This will work for id = a, but not id = b where the label sequence is: x, y, y, y, y, x. I don't care that it was 3 y's in a row I am just interested that it went from x to y to x to y.
I'm not sure if this is possible, I was trying a combination of group by and partition, but not successful.