0

I'm working with a Postgres table named orders that looks like this:

user_id   product       order_date
1         pants         7/1/2022
2         shirt         6/1/2022
1         socks         3/17/2023
3         pants         2/17/2023
4         shirt         3/13/2023
2         pants         8/15/2022
1         hat           4/15/2022
5         hat           3/14/2023
2         socks         12/3/2022
3         shirt         4/15/2023
4         socks         1/15/2023
4         pants         4/19/2023
5         shirt         5/2/2023
5         belt          5/15/2023


Here is a dB Fiddle with the data: https://www.db-fiddle.com/f/uNGjP7gpKwdPGrJ7XmT7k3/2

I output a table that shows the sequence of a customer's orders:

user_id   first_order   second_order    third_order
1         hat           pants           socks
2         shirt         pants           socks
3         pants         shirt           <null>
4         socks         shirt           pants
5         hat           shirt           belt            

So, for example, customer 1 first purchased a hat, then purchased pants, and finally purchased socks.

I'd like to set some sort of indicator at the row level that tells me whether a particular customer purchased one product before they purchased another product. For example, I'd like to indicate whether a customer purchased a shirt before they purchased pants.

The desired output would look like this:

user_id   first_order   second_order    third_order     shirt_before_pants
1         hat           pants           socks           false
2         shirt         pants           socks           true
3         pants         shirt           <null>          false
4         socks         shirt           pants           true
5         hat           shirt           belt            false

Is there a way to get the relative position of a given value at the row level?

Thanks for your help, -Rachel

crimson
  • 39
  • 3
  • 1
    is it guaranteed that the report will contain only 3 columns? – Salman A May 15 '23 at 20:23
  • No, there is no guarantee that it will be limited to 3 columns. It could be 10+ columns. I just used 3 for this minimally reproducible example. – crimson May 15 '23 at 20:34
  • What is the maximum number of order columns you want to display? Do you actually need to display separate columns per order? SQL is rather rigid with the ROW type of the output. Isn't a single array column for any number of orders the better choice? Also, do you maybe only need the ones that pass your filter (shirt before pants). You can have *that* a lot cheaper if only a fraction buys either shirt or pants. Also, please always disclose your version of Postgres. – Erwin Brandstetter May 15 '23 at 21:47
  • Also: Do you mean "immediately before", or any time? Can there be other orders in between? And do you have a separate table holding one row per relevant user, like `users`? That would help to make the query fast if you just want the first 3 orders while there can be lots for each user ... – Erwin Brandstetter May 15 '23 at 22:02

4 Answers4

2

We can enumerate the orders of each customer with row_number(), then use conditional aggregation to generate the new columns. To check if a product was bought before another, we can compare the minimum order date of both products:

select user_id,
    max(product) filter(where rn = 1) product_1,
    max(product) filter(where rn = 2) product_2,
    max(product) filter(where rn = 3) product_3,
    ( 
          min(order_date) filter(where product = 'shirt') 
        < min(order_date) filter(where product = 'pants')
    ) shirt_before_pants
from (
    select o.*, row_number() over(partition by user_id order by order_date) rn
    from orders o
) o
group by user_id
        
GMB
  • 216,147
  • 25
  • 84
  • 135
1

If ...

  • ... "before" is supposed to mean "immediately before", no other order in between
  • ... an array of products instead of separate columns for each product is acceptable
  • ... you have a separate "users" table
SELECT o.*
FROM   users u
CROSS  JOIN LATERAL (
   SELECT o.user_id
        , array_agg(o.product) AS products
        , bool_or(o.combo) AS shirt_before_pants
   FROM  (
      SELECT o.user_id, o.product::text
           , o.product = 'pants' AND lag(o.product) OVER (ORDER BY o.order_date) = 'shirt' AS combo
      FROM   orders o
      WHERE  o.user_id = u.user_id
      ORDER  BY o.order_date
      LIMIT  3  -- cutoff
      ) o
   GROUP  BY 1
   ) o
ORDER  BY u.user_id;

fiddle

The beauty of it: only change the LIMIT for a different number of orders in your request. And only change 'pants' ans 'shirt' in one place.

Products in the output array are sorted, because of the sort in the subquery. See:

The query performs well for a big table with many orders per user if you have an index on orders(user_id, order_date) or, even better, orders(user_id, order_date) INCLUDE (product).

If you don't have a users table (you should have one), create it like this:

CREATE TABLE users AS
SELECT DISTINCT user_id
FROM   orders
ORDER  BY user_id;  -- optional

Or read here for a faster way:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
0

This method uses the window function ROW_NUMBER (DENSE_RANK also can work), which assigns a row number to each row aggregated by the user_id. To determine whether the shirt was purchased before the pants, we can compare the generated row_ids of those products :

With cte as (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date) AS rn
  FROM orders
)
select user_id, max(case when rn = 1 then product end) as first_order,
                max(case when rn = 2 then product end) as second_order,
                max(case when rn = 3 then product end) as third_order,
                MAX(case when product = 'shirt' then rn end) 
                < MAX(case when product = 'pants' then rn end) as shirt_before_pants
from cte
GROUP BY user_id;
SelVazi
  • 10,028
  • 2
  • 13
  • 29
0

The array_position function may be helpful here:

WITH 

first_orders AS (
  SELECT "user_id", "product", MIN("order_date") AS "order_date"
  FROM "orders"
  GROUP BY "user_id", "product"),

product_arrays AS (
  SELECT "user_id", 
    array_agg(product ORDER BY order_date) AS "products"
  FROM first_orders
  GROUP BY "user_id")
  
SELECT * 
FROM product_arrays
WHERE array_position(products, 'shirt') 
         < array_position(products, 'pants')

Or perhaps the following would work just as well:

WITH 

product_arrays AS (
  SELECT "user_id", 
    array_agg(product ORDER BY order_date) AS "products"
  FROM orders
  GROUP BY "user_id")
  
SELECT * 
FROM product_arrays
WHERE array_position(products, 'shirt') 
         < array_position(products, 'pants')
Ian Gow
  • 3,098
  • 1
  • 25
  • 31