SQL: order by, then select first row with distinct value for multiple columns

Question

As a simplified example, I need to select each instance where a customer had a shipping address that was different from their previous shipping address. So I have a large table with columns such as:

purchase_id | cust_id | date | address  | description
-----------------------------------------------------------
 1          | 5       | jan  | address1 | desc1
 2          | 6       | jan  | address2 | desc2
 3          | 5       | feb  | address1 | desc3
 4          | 6       | feb  | address2 | desc4
 5          | 5       | mar  | address3 | desc5
 6          | 5       | mar  | address3 | desc6
 7          | 5       | apr  | address1 | desc7
 8          | 6       | may  | address4 | desc8

Note that customers can "move back" to a previous address as customer 5 did in row 7.

What I want to select (and as efficiently as possible as this is a quite large table) is the first row out of every 'block' wherein a customer had subsequent orders shipped to the same address. In this example that would be rows 1,2,5,7,and 8. In all the others, the customer has the same address as their previous order.

So effectively I want to first ORDER BY (cust_id, date), then SELECT purchase_id, cust_id, min(date), address, description.

However I'm having trouble because SQL usualy requires GROUP BY to be done before ORDER BY. I can't therefore figure out how to adapt e.g. either of the top answers to this question (which I otherwise quite like.) It is necessary (conceptually, at least) to order by date before grouping or using aggregate functions like min(), otherwise I would miss instances like row 7 in my example table, where a customer 'moved back' to a previous address.

Note also that two customers can share an address, so I need to effectively group by both cust_id and address after ordering by date.

I'm using snowflake which I believe has most of the same commands available as recent versions of PostgreSQL and SQL Server (although I'm fairly new to snowflake so not completely sure.)

Do you only want to return purchases for customers with more than one address? — Anthony E, Apr 18 '16 at 03:54
Anthony E: No, I want to return (at least) 1 row for all customers who have ever had an address, and more rows for customers who have changed addresses one or more times. Giorgi Nakeuri: Thanks, should be 1,2,5,7, and 8. (Rows 5 and 7 both have a different address than the last one that customer used.) edited. — DNB, Apr 19 '16 at 01:18

Marcin Zukowski · Answer 1 · 2016-05-01T06:15:44.720

Sorry for a late reply. I meant to react to this post a few days ago.

The "most proper" way I can think of is to use the LAG function.

Take this:

select purchase_id, cust_id, address, 
lag(address, 1) over (partition by cust_id order by purchase_id) prev_address 
from x order by cust_id, purchase_id;
-------------+---------+----------+--------------+
 PURCHASE_ID | CUST_ID | ADDRESS  | PREV_ADDRESS |
-------------+---------+----------+--------------+
 1           | 5       | address1 | [NULL]       |
 3           | 5       | address1 | address1     |
 5           | 5       | address3 | address1     |
 6           | 5       | address3 | address3     |
 7           | 5       | address1 | address3     |
 2           | 6       | address2 | [NULL]       |
 4           | 6       | address2 | address2     |
 8           | 6       | address4 | address2     |
-------------+---------+----------+--------------+

And then you can easily detect rows with the events like you described

select purchase_id, cust_id, address, prev_address from (
  select purchase_id, cust_id, address, 
  lag(address, 1) over (partition by cust_id order by purchase_id) prev_address 
  from x 
) sub 
where not equal_null(address, prev_address)
order by cust_id, purchase_id;
-------------+---------+----------+--------------+
 PURCHASE_ID | CUST_ID | ADDRESS  | PREV_ADDRESS |
-------------+---------+----------+--------------+
 1           | 5       | address1 | [NULL]       |
 5           | 5       | address3 | address1     |
 7           | 5       | address1 | address3     |
 2           | 6       | address2 | [NULL]       |
 8           | 6       | address4 | address2     |
-------------+---------+----------+--------------+

Note that I'm using EQUAL_NULL function to have NULL=NULL semantics.

Note that the LAG function can be computationally intensive though (but comparable with using ROW_NUMBER proposed earlier)

LAG definitely does the job. Though I prefer the approach with adding subgroups, previously I had to use two steps cte: LAG and conditional windowed SUM to achieve it. Now with Snowflake's [CONDITIONAL_CHANGE_EVENT](https://stackoverflow.com/a/68603272/5070879) it is so easy :) — Lukasz Szozda, Jul 31 '21 at 15:09
Well I would use my "overused" friend QUALIFY here to avoid the sub-query, but today I learned EQUAL_NULL — Simeon Pilgrim, Mar 19 '22 at 22:12
I find LAG/LEAD to be one of the cheapest WINDOW functions, thus disagree with the computationally expensive, especially with respect to ROW_NUMBER of self joins. — Simeon Pilgrim, Mar 19 '22 at 22:34

Giorgi Nakeuri · Answer 2 · 2016-04-18T05:56:12.400

1

You can use row_number window function to do the trick:

;with cte as(select *, row_number() over(partition by cust_id, address
                                         order by purchase_id) as rn from table)
select * from cte 
where rn = 1

edited Apr 18 '16 at 05:56

answered Apr 18 '16 at 04:16

Giorgi Nakeuri

35,155
8
47
75

Thanks Giorgi Nakeuri, that works. I knew about row_number() but did not realize that I could partition over multiple fields and get the desired result. – DNB Apr 19 '16 at 01:20
2

I'm not sure this is a correct answer. It doesn't find row 7, because it has the same cust_id and address. – Marcin Zukowski May 01 '16 at 06:14

Lukasz Szozda · Answer 3 · 2021-08-01T08:47:51.620

Snowflake has introduced CONDITIONAL_CHANGE_EVENT, which ideally solves described case:

Returns a window event number for each row within a window partition when the value of the argument expr1 in the current row is different from the value of expr1 in the previous row. The window event number starts from 0 and is incremented by 1 to indicate the number of changes so far within that window

Data preparation:

CREATE OR REPLACE TABLE t(purchase_id INT, cust_id INT,
                          date DATE, address TEXT, description TEXT);

INSERT INTO t(purchase_id, cust_id, date, address, description)
VALUES 
 ( 1, 5, '2021-01-01'::DATE ,'address1','desc1')
,( 2, 6, '2021-01-01'::DATE ,'address2','desc2')
,( 3, 5, '2021-02-01'::DATE ,'address1','desc3')
,( 4, 6, '2021-02-01'::DATE ,'address2','desc4')
,( 5, 5, '2021-03-01'::DATE ,'address3','desc5')
,( 6, 5, '2021-03-01'::DATE ,'address3','desc6')
,( 7, 5, '2021-04-01'::DATE ,'address1','desc7')
,( 8, 6, '2021-05-01'::DATE ,'address4','desc8');

Query:

SELECT *, 
 CONDITIONAL_CHANGE_EVENT(address) OVER (PARTITION BY CUST_ID ORDER BY DATE) AS CCE
FROM t
ORDER BY purchase_id;

Once the subgroup: CCE column is identified, QUALIFY could be used to find the first row per each CUST_ID, CCE.

Full query:

WITH cte AS (
 SELECT *,
  CONDITIONAL_CHANGE_EVENT(address) OVER (PARTITION BY CUST_ID ORDER BY DATE) AS CCE
 FROM t
)
SELECT *
FROM  cte
QUALIFY ROW_NUMBER() OVER(PARTITION BY CUST_ID, CCE ORDER BY DATE) = 1
ORDER BY purchase_id;

Output:

Anthony E · Answer 4 · 2016-04-18T04:07:14.757

This would probably be best solved by a subquery to get the first purchase for each user, then using IN to filter rows based on that result.

To clarify, purchase_id is an autoincrement column, correct? If so, a purchase with a higher purchase_id must have been created at a later date, and the following should suffice:

SELECT *
FROM purchases
WHERE purchase_id IN (
  SELECT MIN(purchase_id) AS first_purchase_id
  FROM purchases
  GROUP BY cust_id
)

If you only want the first purchase for customers with more than one address, add a HAVING clause to your subquery:

SELECT *
FROM purchases
WHERE purchase_id IN (
  SELECT MIN(purchase_id) AS first_purchase_id
  FROM purchases
  GROUP BY cust_id
  HAVING COUNT(DISTINCT address) > 1
)

Fiddle: http://sqlfiddle.com/#!9/12d75/6

However, if purchase_id is NOT an autoincrement column, then SELECT on both cust_id and min(date) on your subquery and use an INNER JOIN on cust_id and min(date):

SELECT *
FROM purchases
INNER JOIN (
  SELECT cust_id, MIN(date) AS min_date
  FROM purchases
  GROUP BY cust_id
  HAVING COUNT(DISTINCT address) > 1
) cust_purchase_date
ON purchases.cust_id = cust_purchase_date.cust_id AND purchases.date = cust_purchase_date.min_date

The first query example will probably be faster, however, so use that if you purchase_id is an autoincrement column.

Thanks Anthony however this does not return each subsequent 'new' address for each customer. I don't just want each user's first purchase; i want *each* first purchase which has a shipping address different from the previous shipping address. — DNB, Apr 19 '16 at 01:24

score 0 · Answer 5 · answered Mar 19 '22 at 22:31

Yet more late options/opinions:

Given this is a edge detection, LAG/LEAD (depending which edge you are looking for) is the simplest tool.

Marcin's LAG option can be moved from a sub-select to a first level option, with QUALIFY.

Where the NOT and EQUAL_NULL adds value is if there was a null address the first LAG would also return null, those would be not equal, and on flipping, become true. So EQUAL_NULL safe compare catches that nicely.

SELECT * 
FROM data_table 
QUALIFY not equal_null(address, lag(address) over(partition by cust_id order by purchase_id))
ORDER BY 1

giving:

PURCHASE_ID	CUST_ID	DATE	ADDRESS	DESCRIPTION
1	5	2021-01-01	address1	desc1
2	6	2021-01-01	address2	desc2
5	5	2021-03-01	address3	desc5
7	5	2021-04-01	address1	desc7
8	6	2021-05-01	address4	desc8

Lukasz's CONDITIONAL_CHANGE_EVENT is a very nice solution, but CONDITIONAL_CHANGE_EVENT is not just finding a change edge but enumerating them, so if you we looking for the 5th change, or such, then CONDITIONAL_CHANGE_EVENT saves you having to chain a LAG/LEAD with a ROW_NUMBER(). And as such, you cannot collapse that solution into a single block:

like:

 ROW_NUMBER() OVER(PARTITION BY CUST_ID, CONDITIONAL_CHANGE_EVENT(address) OVER (PARTITION BY CUST_ID ORDER BY DATE) ORDER BY DATE) = 1

because the implicit row_number inside CONDITIONAL_CHANGE_EVENT generates the error:

Window function x may not be nested inside another window function.

SQL: order by, then select first row with distinct value for multiple columns

5 Answers5