1

This question follows this one.

I now have this code:

select
    min(purchaseDate) filter (where fruitType = 'apple') as appleFirstPurchaseDate,
    min(purchaseDate) filter (where fruitType = 'orange') as orangeFirstPurchaseDate,
    customer
from fruitPurchases
group by customer

Which gives the following output:

appleFirstPurchaseDate  orangeFirstPurchaseDate
----------------------  -----------------------
       2017-05-03              2016-11-25

You should also know that each piece of fruit has an ID. With this ID, I'd like to create two keys: one that concatenates the fruit ID with the appleFirstPurchaseDate and the other one that concatenates the fruit ID with the orangeFirstPurchaseDate

Therefore I tried this:

select
    fruitId || '-' || min(purchaseDate) filter (where fruitType = 'apple') as appleKey,
    min(purchaseDate) filter (where fruitType = 'apple') as apple,
    fruitId || '-' || min(purchaseDate) filter (where fruitType = 'orange') as orangeKey,
    min(purchaseDate) filter (where fruitType = 'orange') as orange,
    customer
from fruitPurchases
group by customer, fruitId

But when a customer has already bought apples as well as oranges, it unfortunately gives me this:

  appleKey        appleFirstPurchaseDate    orangeKey    orangeFirstPurchaseDate
  --------        ----------------------    ---------    -----------------------
283-2017-05-03          2017-05-03           [NULL]             [NULL]
   [NULL]                 [NULL]          322-2016-11-25      2016-11-25

While I would like this:

  appleKey        appleFirstPurchaseDate    orangeKey    orangeFirstPurchaseDate
  --------        ----------------------    ---------    -----------------------
283-2017-05-03          2017-05-03       322-2016-11-25        2016-11-25

Last piece of information: an earlier FirstPurchaseDate does not imply a lower fruitID.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Guillaume
  • 168
  • 2
  • 14

2 Answers2

3

This is a combination of a "greatest-per-group" problem, followed by "pivoting".

General assumptions:

  • All involved columns are defined NOT NULL - or there are problem with sorting and concatenation.

  • (customer, fruidType, purchaseDate) is unique - or you need to define rules how to break ties.

  • You want just two fruits in the result.

The basic problem in your query: you only want customer in the GROUP BY clause, since you want one row per customer. Not customer, fruitid which produces one row per combination of customer and fruitid.

But there is no built-in aggregate function to retrieve, in a single step,
the fruitID from the same row that also holds the earliest purchaseDate for each (customer, fruidType).

You could make your query work by appending the fruitID (instead of prepending) because the concatenated text would still sort with the earliest date first, but that's very ugly and needlessly slow:

SELECT customer
     , min(purchaseDate || '-' || fruitId) FILTER (WHERE fruitType = 'apple')  AS appleKey
                       , min(purchaseDate) FILTER (WHERE fruitType = 'apple')  AS apple
     , min(purchaseDate || '-' || fruitId) FILTER (WHERE fruitType = 'orange') AS orangeKey
                       , min(purchaseDate) FILTER (WHERE fruitType = 'orange') AS orange
     , customer
FROM   fruitPurchases
GROUP  BY customer;

I wouldn't be caught using that.

There are the related window function first_value() and last_value(), but those don't aggregate. And you cannot use the FILTER clause, which is only for aggregate function. So you would need an extra query level, and it's simpler to just use the window function row_number() instead to mark the first row for each group in a subquery or CTE ...

@Gordon made it work with reversed frame definitions for the window function. Consider this simplified, completed, and optimized version:

SELECT DISTINCT ON (customer)
       customer
     , first_value(fruitId || '-' || purchaseDate) OVER a AS appleKey
     , first_value(purchaseDate)                   OVER a AS appleFirstPurchaseDate
     , first_value(fruitId || '-' || purchaseDate) OVER o AS orangeKey
     , first_value(purchaseDate)                   OVER o AS orangeFirstPurchaseDate
FROM  (SELECT * FROM fruitPurchases WHERE fruitType IN ('apple', 'orange')) sub
WINDOW a AS (PARTITION BY customer ORDER BY fruittype ASC , purchaseDate)
     , o AS (PARTITION BY customer ORDER BY fruittype DESC, purchaseDate);

But this should be faster:

WITH cte AS (
   SELECT DISTINCT ON (customer, fruitType)
          customer, fruitType, fruitId || '-' || purchaseDate AS key, purchaseDate
   FROM   fruitPurchases
   WHERE  fruitType IN ('apple', 'orange')
   ORDER  BY customer, fruitType, purchaseDate
   )
SELECT customer
     , a.key          AS appleKey
     , a.purchaseDate AS appleFirstPurchaseDate
     , o.key          AS orangeKey
     , a.purchaseDate AS orangeFirstPurchaseDate
FROM   cte a
JOIN   cte o USING (customer)
WHERE  a.fruitType = 'apple'
AND    o.fruitType = 'orange';

All it needs to excel is an index on (customer, fruitType, purchaseDate) - or on (customer, fruitType, purchaseDate, fruitId) if your table allows index-only scans. Details depend on undisclosed information. Related:

The CTE computes the greatest per group with DISTINCT ON:

Depending on actual data distribution there may be even faster techniques:

The outer SELECT is a simplistic pivoting technique. Works for any number of fruits.

The same using said window function row_number() in the CTE:

WITH cte AS (
   SELECT customer, fruitType, fruitId || '-' || purchaseDate AS key, purchaseDate
        , row_number() OVER (PARTITION BY customer, fruitType ORDER BY purchaseDate) AS rn
   FROM   fruitPurchases
   WHERE  fruitType IN ('apple', 'orange')
   )
SELECT customer
     , a.key          AS appleKey
     , a.purchaseDate AS appleFirstPurchaseDate
     , o.key          AS orangeKey
     , a.purchaseDate AS orangeFirstPurchaseDate
FROM   cte a
JOIN   cte o USING (customer, rn)
WHERE  a.rn = 1
AND    a.fruitType = 'apple'
AND    o.fruitType = 'orange';

dbfiddle here

But I would seriously consider normalizing your DB design first, which would make the task simpler and faster.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
1

You might find this easier with window functions:

select distinct customer,
       first_value(fruitId || '-' purchaseDate) filter (where fruittype = 'apple') over (partition by customer order by purchaseDate) as appleKey,
       first_value(fruitId || '-' purchaseDate) filter (where fruittype = 'orange') over (partition by customer order by purchaseDate) as orangeKey
from fruitPurchases;

You can also write this as:

select distinct customer,
       first_value(case when fruittype = 'apple' then fruitId || '-' || purchaseDate end) over (partition by customer order by (fruittype = 'apple')::int desc, purchaseDate) as appleKey,
       first_value(case when fruittype = 'orange' then fruitId || '-' || purchaseDate end) over (partition by customer order by (fruittype = 'orange')::int desc, purchaseDate) as orangeKey
from fruitPurchases;

Here is a SQL Fiddle.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • I get the following error message: "FILTER is not implemented for non-aggregate window functions". Besides, how is this supposed to yield the minimum of the different purchaseDates? – Guillaume Nov 14 '17 at 21:45
  • Gordon, neither the syntax, nor the logic work out in the 2nd query. – Erwin Brandstetter Nov 15 '17 at 05:39
  • @ErwinBrandstetter . . . For the second, one, I just don't know what I was thinking. It is fixed now. – Gordon Linoff Nov 15 '17 at 13:07
  • Now it works! `CASE` is redundant with the new frame definition. And performance might be optimized. I added a variant to my answer. (And the 1st query is still illegal - no `FILTER` for genuine window functions.) – Erwin Brandstetter Nov 15 '17 at 15:47