As a simplified example, I need to select each instance where a customer had a shipping address that was different from their previous shipping address. So I have a large table with columns such as:
purchase_id | cust_id | date | address | description
-----------------------------------------------------------
1 | 5 | jan | address1 | desc1
2 | 6 | jan | address2 | desc2
3 | 5 | feb | address1 | desc3
4 | 6 | feb | address2 | desc4
5 | 5 | mar | address3 | desc5
6 | 5 | mar | address3 | desc6
7 | 5 | apr | address1 | desc7
8 | 6 | may | address4 | desc8
Note that customers can "move back" to a previous address as customer 5 did in row 7.
What I want to select (and as efficiently as possible as this is a quite large table) is the first row out of every 'block' wherein a customer had subsequent orders shipped to the same address. In this example that would be rows 1,2,5,7,and 8. In all the others, the customer has the same address as their previous order.
So effectively I want to first ORDER BY (cust_id, date)
, then SELECT purchase_id, cust_id, min(date), address, description
.
However I'm having trouble because SQL usualy requires GROUP BY
to be done before ORDER BY
. I can't therefore figure out how to adapt e.g. either of the top answers to this question (which I otherwise quite like.) It is necessary (conceptually, at least) to order by date before grouping or using aggregate functions like min()
, otherwise I would miss instances like row 7 in my example table, where a customer 'moved back' to a previous address.
Note also that two customers can share an address, so I need to effectively group by both cust_id
and address
after ordering by date.
I'm using snowflake which I believe has most of the same commands available as recent versions of PostgreSQL and SQL Server (although I'm fairly new to snowflake so not completely sure.)