This is a combination of a "greatest-per-group" problem, followed by "pivoting".
General assumptions:
All involved columns are defined NOT NULL
- or there are problem with sorting and concatenation.
(customer, fruidType, purchaseDate)
is unique - or you need to define rules how to break ties.
You want just two fruits in the result.
The basic problem in your query: you only want customer
in the GROUP BY
clause, since you want one row per customer. Not customer, fruitid
which produces one row per combination of customer
and fruitid
.
But there is no built-in aggregate function to retrieve, in a single step,
the fruitID
from the same row that also holds the earliest purchaseDate
for each (customer, fruidType)
.
You could make your query work by appending the fruitID
(instead of prepending) because the concatenated text
would still sort with the earliest date first, but that's very ugly and needlessly slow:
SELECT customer
, min(purchaseDate || '-' || fruitId) FILTER (WHERE fruitType = 'apple') AS appleKey
, min(purchaseDate) FILTER (WHERE fruitType = 'apple') AS apple
, min(purchaseDate || '-' || fruitId) FILTER (WHERE fruitType = 'orange') AS orangeKey
, min(purchaseDate) FILTER (WHERE fruitType = 'orange') AS orange
, customer
FROM fruitPurchases
GROUP BY customer;
I wouldn't be caught using that.
There are the related window function first_value()
and last_value()
, but those don't aggregate. And you cannot use the FILTER
clause, which is only for aggregate function. So you would need an extra query level, and it's simpler to just use the window function row_number()
instead to mark the first row for each group in a subquery or CTE ...
@Gordon made it work with reversed frame definitions for the window function. Consider this simplified, completed, and optimized version:
SELECT DISTINCT ON (customer)
customer
, first_value(fruitId || '-' || purchaseDate) OVER a AS appleKey
, first_value(purchaseDate) OVER a AS appleFirstPurchaseDate
, first_value(fruitId || '-' || purchaseDate) OVER o AS orangeKey
, first_value(purchaseDate) OVER o AS orangeFirstPurchaseDate
FROM (SELECT * FROM fruitPurchases WHERE fruitType IN ('apple', 'orange')) sub
WINDOW a AS (PARTITION BY customer ORDER BY fruittype ASC , purchaseDate)
, o AS (PARTITION BY customer ORDER BY fruittype DESC, purchaseDate);
But this should be faster:
WITH cte AS (
SELECT DISTINCT ON (customer, fruitType)
customer, fruitType, fruitId || '-' || purchaseDate AS key, purchaseDate
FROM fruitPurchases
WHERE fruitType IN ('apple', 'orange')
ORDER BY customer, fruitType, purchaseDate
)
SELECT customer
, a.key AS appleKey
, a.purchaseDate AS appleFirstPurchaseDate
, o.key AS orangeKey
, a.purchaseDate AS orangeFirstPurchaseDate
FROM cte a
JOIN cte o USING (customer)
WHERE a.fruitType = 'apple'
AND o.fruitType = 'orange';
All it needs to excel is an index on (customer, fruitType, purchaseDate)
- or on (customer, fruitType, purchaseDate, fruitId)
if your table allows index-only scans. Details depend on undisclosed information. Related:
The CTE computes the greatest per group with DISTINCT ON
:
Depending on actual data distribution there may be even faster techniques:
The outer SELECT
is a simplistic pivoting technique. Works for any number of fruits.
The same using said window function row_number()
in the CTE:
WITH cte AS (
SELECT customer, fruitType, fruitId || '-' || purchaseDate AS key, purchaseDate
, row_number() OVER (PARTITION BY customer, fruitType ORDER BY purchaseDate) AS rn
FROM fruitPurchases
WHERE fruitType IN ('apple', 'orange')
)
SELECT customer
, a.key AS appleKey
, a.purchaseDate AS appleFirstPurchaseDate
, o.key AS orangeKey
, a.purchaseDate AS orangeFirstPurchaseDate
FROM cte a
JOIN cte o USING (customer, rn)
WHERE a.rn = 1
AND a.fruitType = 'apple'
AND o.fruitType = 'orange';
dbfiddle here
But I would seriously consider normalizing your DB design first, which would make the task simpler and faster.