Using LAG() and PARTITION BY to return rows IF within 10 days of date

Question

I have the following table called fruits:

id  fruit_bought   quantity   date
1 | orange       | 100      | 2018-01-10
2 | apple        | 50       | 2018-02-05
3 | orange       | 75       | 2018-03-07
4 | orange       | 200      | 2018-03-15
5 | apple        | 10       | 2018-03-17
6 | orange       | 20       | 2018-03-20

I want to return rows that have fruit_bought of orange if any time in the previous 10 days any oranges were bought, starting with the date 2018-03-20 (row with id 6).

For example:

Starting at 2018-03-20, there was a purchase of oranges on this date (row id 6)
Was there any orange purchases 10 days prior from this? Yes: on '2018-03-15' (row id 4)
Was there any orange purchases 10 days prior from this date? Yes: on '2018-03-07' (row id 3)
Was there any orange purchases 10 days prior from this date? No.

In the end, the query I'm trying to create would return the rows with the id of 3, 4, and 6 (but not 1).

The query I have so far is the following:

SELECT *, LAG(date, 1) OVER (PARTITION BY fruit_bought) FROM fruits
WHERE fruit_bought = 'orange';

This returns every row where fruit_bought is orange, and it adds an additional lag column.

unutbu · Accepted Answer · 2019-02-09T15:08:55.157

This answer is based on Gordon Linoff's idea, but with a few tweaks:

FILTER is not implemented for pure window functions like lead() or lag() in Postgresql 11 (yet). So use WHERE fruit_bought='orange' as a condition on the entire inner SELECT.
To guarantee selection of the row with the last date, use LEAD(date, 1, '-infinity'). This makes the default value for next_date equal to a -infinity timestamp. Thus date >= next_date - interval '10 day' will be TRUE for the last date.
Let's call rows within 10 days of each other a cluster. To select only rows from the last cluster, compute a cumulative sum which counts how many times cond is FALSE (since FALSE values separate clusters):
```
SUM(CASE WHEN cond IS TRUE THEN 0 ELSE 1 END) OVER (ORDER BY date DESC) AS cluster_num
```
and select only rows for which cluster_num equals 0. Since we ORDER BY date DESC, the 0th cluster is the last cluster.

SELECT *
FROM (
    SELECT *, SUM(CASE WHEN cond IS TRUE THEN 0 ELSE 1 END) OVER (ORDER BY date DESC) AS cluster_num
    FROM (
        SELECT *, date >= next_date - interval '10 day' AS cond
        FROM (
            SELECT id, fruit_bought, date, 
                LEAD(date, 1, '-infinity') 
                OVER (PARTITION BY fruit_bought ORDER BY date) AS next_date 
            FROM fruits 
            WHERE fruit_bought='orange'
            -- restrict date here to specify an "initial date"
            AND date <= '2018-04-01'  
        ) t1
    ) t2
) t3
WHERE cond AND cluster_num = 0
ORDER BY date ASC

yields

| id | fruit_bought |       date |  next_date | cond | cluster_num |
|----+--------------+------------+------------+------+-------------|
|  3 | orange       | 2018-03-07 | 2018-03-15 | t    |           0 |
|  4 | orange       | 2018-03-15 | 2018-03-20 | t    |           0 |
|  6 | orange       | 2018-03-20 |  -infinity | t    |           0 |

Setup:

CREATE TABLE fruits (
    fruitid INT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
    id INT,
    fruit_bought TEXT,
    quantity INT,
    date DATE);

INSERT INTO fruits (id, fruit_bought, quantity, date)
VALUES (1,'orange',100,'2018-01-10')
, (2,'apple',50,'2018-02-05')
, (3,'orange',75,'2018-03-07')
, (4,'orange',200,'2018-03-15')
, (5,'apple',10,'2018-03-17')
, (6,'orange',20,'2018-03-20')
, (7,'orange',20,'2018-01-09');

Is there a way to provide an initial date to being the search from? For example: start from date `2018-03-20` and go backwards? The query in your example seems to search through my entire table. — MonkeyOnARock, Feb 09 '19 at 14:31
Sure -- you can use a `date <= initial_date` conditional in the innermost `SELECT` to restrict which dates are part of the overall query. I've edited the post to show what I mean. — unutbu, Feb 09 '19 at 15:10

score 0 · Answer 2 · answered Feb 09 '19 at 12:40

One method is lag() with filter . . . but used like this:

select f.*
from (select f.*,
             lag(date) filter (where fruit_bought = 'orange') over (order by date) as prev_orange_date
      from fruits f
     ) f
where prev_orange_date >= date - interval '10 day';

However, exists also comes to mind:

select f.*
from fruits f
where exists (select 1
              from fruits f2
              where f2.fruit_bought = 'orange' and
                    f2.date >= f.date - interval '10 day' and
                    f2.date < f.date
             );

Both of these queries assume that dates are unique, as in your example. If you have ties, then each can work. However, you would have to specify how to handle days when an orange is purchased.

score 0 · Answer 3 · edited May 16 '20 at 23:13

0

Try

select fruit_bought, min(date), max(date) group by fruit_bought
having (max(date) - min(date)) <= 10;

edited May 16 '20 at 23:13

Dharman

30,962
25
85
135

answered Feb 09 '19 at 12:54

Jim Macaulay

4,709
4
28
53

Using LAG() and PARTITION BY to return rows IF within 10 days of date

3 Answers3