0

Hi i want to know how to remove consecutive duplicates from a query results based on single column. In this case null means they didn't want to buy anything so they click back button. I want history of changes of items from buyer, with earliest buying info.

select item, buyer, buy_date from item order by buy_date

item           buyer           buy_date
null           Sam             04/24/2016
Lipstick       Anna            05/31/2016
Charger        Tim             06/01/2016
Charger        James           06/03/2016
null           Tim             06/03/2016
null           James           06/04/2016
Nail Polish    Sarah           06/04/2016

Here is sample of test results.

Now my new results should be

item           buyer           buy_date
null           Sam             04/24/2016
Lipstick       Anna            05/31/2016
Charger        Tim             06/01/2016
null           Tim             06/03/2016
Nail Polish    Sarah           06/04/2016

Keep only the first of the consecutive duplicates. I am not deleting any record. I am just filtering out the results so that consecutive duplicates are remove.

  1. How would I do this in generic ANSI-SQL?
  2. If it is not possible, is there way to do it in the three major SQL vendor?
Chun ping Wang
  • 3,879
  • 12
  • 42
  • 53
  • Tag your question with the database you are actually using. Or give a really good reason why you need code in multiple databases. – Gordon Linoff Jun 05 '16 at 02:37

3 Answers3

3

You can do this with ANSI standard window functions. One method uses lag() to get the previous item. It then sums up the number of times when the item changes and uses this information to find the first row in each group:

select r.*
from (select r.*, row_number() over (partition by grp order by date) as seqnum
      from (select r.*,
                   sum(case when prev_item = item then 0 else 1 end) over (order by buy_date) as grp
            from (select r.*, lag(item) over (order by buy_date) as prev_item
                  from results r
                 ) r
           ) r
     ) r
where seqnum = 1;

Hmmm. I overthought that one. You only need lag():

select r.*
from (select r.*, lag(item) over (order by buy_date) as prev_item
      from results r
     ) r
where prev_item is null or prev_item <> item;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Hi actually there is a problem. For nulls, this won't work. SO i have first three items are null, it won't take the first one. (null means not bought or ignored) – Chun ping Wang Jun 07 '16 at 14:04
  • @ChunpingWang . . . Actually, I think it should take all three `NULL` items. What is happening exactly? It might be easier to explain in another question. – Gordon Linoff Jun 08 '16 at 01:25
  • actually i added the following item is null and prev_item is not null or item is not null and prev_item is null or (item is null and prev_item is null and buy_date = select min(buy_date) from item where item is null and prev_item is null) – Chun ping Wang Jun 08 '16 at 04:47
0

Though I may not be understanding the question I would say the simplest way to pull only one unique item based on its name it to just use DISTINCT on that field. So the new query would look like this:

select DISTINCT(item), buyer, buy_date from item order by buy_date

EDIT: Nevermind. I see that the question is specifically for consecutive results. See Gordon's response.

Note that MySQL doesn't have a lag() function but I did manage to find this post to simulate one: Simulate lag function in MySQL

Good luck!

Community
  • 1
  • 1
Tom Mulkins
  • 431
  • 4
  • 8
0

This is a generic query which could be use on any database if they don't support lag or partition by functions (like in MySQL)

select ITEM,BUYER,BUY_DATE from 
(
    select t1.item,
    max(t1.buyer) as buyer,
    max(t1.buy_date) as buy_date,
    count(*) as cnt
    from myTable t1
      inner join myTable t2
    on t1.item=t2.item
      and t1.buy_date <=t2.buy_date
    group by t1.item,t1.buy_date
)
where cnt=1

The inner query, will derive the count based on buy_date, so it would give latest record for each item. You can change the join condition to t1.buy_date >=t2.buy_date if you want the result other way around.

The inner query will give you output like below.

+-------------+-------+----------------------+-----+
|    ITEM     | BUYER |       BUY_DATE       | CNT |
+-------------+-------+----------------------+-----+
| Lipstick    | Anna  | 31-MAY-2016 00:00:00 |   1 |
| Charger     | James | 03-JUN-2016 00:00:00 |   1 |
| Charger     | Tim   | 01-JUN-2016 00:00:00 |   2 |
| Nail Polish | Sarah | 04-JUN-2016 00:00:00 |   1 |
+-------------+-------+----------------------+-----+

Now in outer query, give condition cnt=1 to get only distinct records.

Utsav
  • 7,914
  • 2
  • 17
  • 38
  • Wouldn't you get into trouble using `max(t1.buyer)`, if the lexigraphically max buyer name does not coincide with the max buy date? – Tim Biegeleisen Jun 05 '16 at 05:27
  • This is just to bypass the rule `In a group by , if a column is used in select, you have to use aggregate function on it, or use it in group by`. It will not change the output in anyway. – Utsav Jun 05 '16 at 06:29