1

I am trying to find the first 3 timestamps for each customer.

Sample data for table customer_orders:

customer_id timestamp
6778 '2022-01-01'
6778 '2022-02-05'
5544 '2022-04-01'
6778 '2022-02-04'
5544 '2022-04-03'
5544 '2022-04-02'
5544 '2022-01-01'
6778 '2021-01-01'

Desired outcome:

customer_id timestamp
5544 '2022-01-01'
5544 '2022-04-01'
5544 '2022-04-02'
6778 '2021-01-01'
6778 '2022-01-01'
6778 '2022-04-02'

My query so far:

SELECT
    customer_id, 
    timestamp
FROM customer_orders
GROUP BY customer_id, timestamp
ORDER BY timestamp ASC
LIMIT 3

LIMIT 3 limits to 3 rows overall. But I want 3 rows per customer.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Dizz
  • 43
  • 6

2 Answers2

1

You can use the window function ROW_NUMBER() to numerate the columns of a given PARTITION (customer_id in your case) within a CTE and in the outer query just filter the n records from this generated column:

WITH j AS (
  SELECT customer_id, timestamp,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY timestamp
                       RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS n
  FROM customer_orders
) 
SELECT customer_id, timestamp FROM j
WHERE n <= 3 
ORDER BY customer_id, timestamp

Demo: db<>fiddle

Jim Jones
  • 18,404
  • 3
  • 35
  • 44
0

@Jim provided a valid solution. But there are (not so) subtle performance details. RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is the default window frame, spelling it out is just noise. The manual:

The default framing option is RANGE UNBOUNDED PRECEDING, which is the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

Not worth another answer, yet. But as row_number() operates on rows by definition, it's more efficient to use ROWS mode:

SELECT customer_id, timestamp
FROM  (
   SELECT row_number() OVER (PARTITION BY customer_id ORDER BY timestamp
                             ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rn
        , customer_id, timestamp
   FROM   customer_orders
   ) sub
WHERE  rn <= 3 
ORDER  BY 1, 2;

Either query is much faster with an index(-only) scan on this index:

CREATE UNIQUE INDEX ON customer_orders (customer_id, timestamp);

I ran extensive tests on Postgres 13 and 14 with ROWS vs. RANGE mode, and ROWS is consistently ~ 20% faster with an index-only scan, and around 5 - 10 % without index. (Higher fixed cost lowers the percentage.) Quite a revelation for one of the most commonly used window functions! :)

I reported the issue, and the upcoming Postgres 16 got a fix! See:


That said, if your table is big and there are many rows per customer, a different query style is much faster, yet. I am talking orders of magnitude. We need the same index as above.

Ideally, you have a table customers with exactly one row per relevant customer_id. If you don't have it, create it. Then:

SELECT c.customer_id, o.timestamp
FROM   customers c
CROSS  JOIN LATERAL (
   SELECT timestamp
   FROM   customer_orders o
   WHERE  o.customer_id = c.customer_id
   ORDER  BY o.timestamp
   LIMIT  3
   ) o
ORDER  BY 1, 2;

db<>fiddle here

Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228