detect sequence in hive column with lead function

Question

I'm trying to detect a sequence in a column of my hive table. I have 3 columns (id, label, index). Each id has a sequence of labels and index is the ordering of the labels, like

id  label   index
a   x   1
a   y   2
a   x   3
a   y   4
b   x   1
b   y   2
b   y   3
b   y   4
b   x   5
b   y   6

I want to identify if the label sequence of x,y,x,y occurs.
I was thinking of trying a lead function to accomplish this like:

select id, index, label,
lead( label, 1) over (partition by id order by index) as l1_fac,
lead( label, 2) over (partition by id order by index) as l2_fac,
lead( label, 3) over (partition by id order by index) as l3_fac
from mytable

yields:

id  index  label  l1_fac  l2_fac  l3_fac
a  1  x  y  x  y
a  2  y  x  y  NULL
a  3  x  y  NULL  NULL
a  4  y  NULL  NULL  NULL
b  1  x  y  y  y
b  2  y  y  y  x
b  3  y  y  x  y
b  4  y  x  y  NULL
b  5  x  y  NULL  NULL

where l1(2,3) are the next label values. Then I could check for a pattern with

where label = l2_fac and l1_fac = l3_fac

This will work for id = a, but not id = b where the label sequence is: x, y, y, y, y, x. I don't care that it was 3 y's in a row I am just interested that it went from x to y to x to y.

I'm not sure if this is possible, I was trying a combination of group by and partition, but not successful.

Do you care *when* the sequence `xyxy` occurs? i.e. at what `index` it occurs? or do you just want to know that it occured *somewhere* for a given `id`? — o-90, Jul 22 '15 at 03:28

score 0 · Answer 1 · edited May 23 '17 at 10:26

I answered this question where the OP wanted to collect items to a list and remove any repeating items. I think this is essentially what you want to do. This would extract actual xyxy sequences and also would account for your second example where xyxy occurs, but is clouded by 2 extra ys. You need to collect the label column to an array using this UDAF -- this will preserve the order -- then use the UDF I referenced, then you can use concat_ws to make the contents of this array a string, and lastly, check that string for the occurrence of your desired sequence. The function instr will spit out the location of the first occurrence and zero if it never finds the string.

Query:

add jar /path/to/jars/brickhouse-0.7.1.jar;
add jar /path/to/other/jar/duplicates.jar;

create temporary function remove_seq_dups as 'com.something.RemoveSequentialDuplicates';
create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';

select id, label_string, instr('xyxy', label_string) str_flg
from (
  select id, concat_ws('', no_dups) label_string
  from (
    select id, remove_seq_dups(label_array) no_dups
    from (
        select id, collect(label) label_array
        from db.table
        group by id ) x
        ) y
     ) z

Output:

id   label_string    str_flg
============================
a    xyxy            1
b    xyxy            1

A better alternative might be to simply collect label with the UDF, make it a string, and then regex out the sequence xyxy but I'm pretty terrible at regex so possibly someone else can comment intelligently on this.

I was hoping for a non-udf answer, but I will try it. Thanks. — boxl, Jul 31 '15 at 00:26

detect sequence in hive column with lead function

1 Answers1