Hive collect_set() but to Remove Consecutive Duplicates

Question

I want to remove consecutive duplicates in an array when using hive.

collect_list() keeps all duplicates, while collect_set() only keeps distinct entries. I kind of need something in the middle ground.

For example, from the below table:

id  |  number
==============
fk        4
fk        4
fk        2
4f        1
4f        8
4f        8
h9        7
h9        4
h9        7

I would like to get something like this:

id | aggregate
===========================
fk   Array<int>(4,2)
4f   Array<int>(1,8)
h9   Array<int>(7,4,7)

The problem is that there is no column in your dataset which determines the order of rows. What makes h9,7 and h9,4 consecutive? Table is not ordered by definition. Only order by guarantees the required order of rows. Read also this: https://stackoverflow.com/a/47416027/2700344 — leftjoin, May 04 '19 at 01:04
great point! forgot to mention that this is already ordered by timestamp. — penguinmom, May 04 '19 at 01:57
You have one more timestamp column, right? without this column the problem is unsalvageable — leftjoin, May 05 '19 at 11:40

leftjoin · Answer 1 · 2019-10-22T12:26:13.200

Use lag() analytic function to get previous number and compare with current number to check for consecutive numbers.

Demo:

with your_table as (--replace this subquery with your table
select stack(11, --the number of tuples
'fk',4,'2019-01-01 10:10:10.123',
'fk',4,'2019-01-01 10:10:10.124',
'fk',2,'2019-01-01 10:10:10.125',
'4f',1,'2019-01-01 10:10:10.126',
'4f',8,'2019-01-01 10:10:10.127',
'4f',8,'2019-01-01 10:10:10.128',
'h9',7,'2019-01-01 10:10:10.129',
'h9',4,'2019-01-01 10:10:10.130',
'h9',7,'2019-01-01 10:10:10.131',
'h9',7,'2019-01-01 10:10:10.132',
'h9',7,'2019-01-01 10:10:10.133'
) as (id, number, order_ts)
) --replace this subquery with your table

select id, collect_list(case when number = lag_number then null else number end) as aggregate
  from 
      (select id, number, order_ts,
              lag(number) over (partition by id order by order_ts) lag_number
         from your_table 
       distribute by id sort by order_ts
      )s         
  group by id;

Result:

id  aggregate   
4f  [1,8]   
fk  [4,2]   
h9  [7,4,7]

Hive collect_set() but to Remove Consecutive Duplicates

1 Answers1