How to label groups in postgresql when group belonging depends on the preceding line?

Question

I want, in a request, to fill all Null values by the last known value. When it's in a table and not in a request, it's easy:

If I define and fill my table as follows:

CREATE TABLE test_fill_null (
    date INTEGER,
    value INTEGER
);

INSERT INTO test_fill_null VALUES 
    (1,2),
    (2, NULL), 
    (3, 45), 
    (4,NULL), 
    (5, null);

SELECT * FROM test_fill_null ;
 date | value 
------+-------
    1 |     2
    2 |      
    3 |    45
    4 |      
    5 |

Then I just have to fill like that:

UPDATE test_fill_null t1
SET value = (
    SELECT t2.value 
    FROM test_fill_null t2 
    WHERE t2.date <= t1.date AND value IS NOT NULL 
    ORDER BY t2.date DESC 
    LIMIT 1
);

SELECT * FROM test_fill_null;
 date | value 
------+-------
    1 |     2
    2 |     2
    3 |    45
    4 |    45
    5 |    45

But now, I'm in a request, like this one:

WITH
    pre_table AS(
        SELECT 
            id1,
            id2,
            tms,
            CASE 
                WHEN tms - lag(tms) over w < interval '5 minutes' THEN NULL
                ELSE id2
            END as group_id
        FROM
            table0 
        window w as (partition by id1 order by tms)
    )

Where the group_id is set to id2 when the previous point is distant from more than 5 minutes, null otherwise. By doing so, I want to end up with group of points that follow each other by less than 5 minutes, and gaps of more than 5 minutes between each groups.

Then I don't know how to proceed. I tried:

    SELECT distinct on (id1, id2)
        t0.id1,
        t0.id2,
        t0.tms,
        t1.group_id
    FROM
        pre_table t0
        LEFT JOIN (
            select
                id1,
                tms,
                group_id
            from pre_table t2
            where t2.group_id is not null
            order by tms desc
        ) t1
        ON 
            t1.tms <= t0.tms AND
            t1.id1 = t0.id1  
    WHERE 
        t0.id1 IS NOT NULL
    ORDER BY
        id1,
        id2,
        t1.tms DESC

But in the final result I have some group with two consecutive points which are distant from more than 5 minutes. Their should be two different groups in this case.

So with one NULL value every 5 minutes, the same group can be perpetuated indefinitely, right? As always: your Postgres version please. And the nice test case at the outset turns out to be *not* applicable to your actual problem. It would make a lot more sense to provide the test case for your actual problem. — Erwin Brandstetter, Dec 16 '15 at 15:29
yes exactly. my version is 9.3.10. I can't give the data as they are. I will produce some fake data with the same model if you want. — Borbag, Dec 16 '15 at 16:19
The assumed underlying order of rows is also undefined. Do you order by `id1`, by `tms` or by `id1, id2, tms`? — Erwin Brandstetter, Dec 16 '15 at 16:24
@Erwin Brandstetter: The thing that helped me the most was the fact that I could use count over a window, and it will increment for each non null value. Should I edit the title of my question in order to reflect that ? — Borbag, Dec 18 '15 at 16:48
If you can think of a title that captures the essence of your question more closely, go ahead! — Erwin Brandstetter, Dec 19 '15 at 00:29

score 2 · Accepted Answer · edited May 23 '17 at 11:54

2

A "select within a select" is more commonly called "subselect" or "subquery" In your particular case it's a correlated subquery. LATERAL joins (new in postgres 9.3) can largely replace correlated subqueries with more flexible solutions:

What is the difference between LATERAL and a subquery in PostgreSQL?

I don't think you need either here.

For your first case this query is probably faster and simpler, though:

SELECT date, max(value) OVER (PARTITION BY grp) AS value
FROM  (
   SELECT *, count(value) OVER (ORDER BY date) AS grp
   FROM   test_fill_null
   ) sub;

count() only counts non-null values, so grp is incremented with every non-null value, thereby forming groups as desired. It's trivial to pick the one non-null value per grp in the outer SELECT.

For your second case, I'll assume the initial order of rows is determined by (id1, id2, tms) as indicated by one of your queries.

SELECT id1, id2, tms
     , count(step) OVER (ORDER BY id1, id2, tms) AS group_id
FROM  (
   SELECT *, CASE WHEN lag(tms, 1, '-infinity') OVER (PARTITION BY id1 ORDER BY id2, tms)
                       < tms - interval '5 min'
                  THEN true END AS step
   FROM   table0
   ) sub
ORDER  BY id1, id2, tms;

Adapt to your actual order. One of these might cover it:

PARTITION BY id1 ORDER BY id2  -- ignore tms
PARTITION BY id1 ORDER BY tms  -- ignore id2

SQL Fiddle with an extended example.

Select longest continuous sequence

edited May 23 '17 at 11:54

Community

1
1

answered Dec 16 '15 at 16:30

Erwin Brandstetter

605,456
145
1,078
1,228

I tried the SQL Fiddle, and I had to correct the data to reflect mine. By doing so I saw that id1 was not a good name, and I mislead you. id1 could be seen as a foreign key. I corrected as follow : INSERT INTO table0 VALUES (1,2, '2015-12-16 16:09:00+01'), (1,3, '2015-12-16 16:11:00+01'), (2,4, '2015-12-16 16:10:00+01'), (2,5, '2015-12-16 16:12:00+01'), (1,6, '2015-12-16 16:13:01+01'); Since the correction your solution doesn't seem to work, but I'll work on it. If I fully understand your answer I might be able to crrect it. – Borbag Dec 18 '15 at 08:57
Actually for my second case, you gave me another way to have my 'context request', not the 'labelling request'. (not so different, you just pass 'tms' to the other side of the inequation). With the help of your answer for the first case I completed it so it would do I want. [sql Fiddle)(http://sqlfiddle.com/#!15/af4e64/9) What do you think about it ? – Borbag Dec 18 '15 at 09:16
I wen from 6 minutes with my query to 5 seconds with yours, thanks ! Just edit your answer with the second part of the query In my fiddle and I will validate your answer – Borbag Dec 18 '15 at 09:29
It can even be simplified like this : [sql fiddle](http://sqlfiddle.com/#!15/af4e64/13) – Borbag Dec 18 '15 at 09:51
@RemiDelassus: I added an updated query, simplified some more. A subquery is typically faster than a CTE in Postgres, and we don't *need* a CTE here. – Erwin Brandstetter Dec 18 '15 at 11:19

Borbag · Answer 2 · 2015-12-18T08:50:22.160

While editing my question I found a solution. It's pretty low though, much lower than my example within a table. Any suggestion to improve it ?

    SELECT
        t2.id1,
        t2.id2,
        t2.tms,
        (
            SELECT t1.group_id 
            FROM pre_table t1 
            WHERE 
                t1.tms <= t2.tms 
                AND t1.group_id IS NOT NULL 
                AND t2.id1 = t2.id1
            ORDER BY t1.tms DESC 
            LIMIT 1
        ) as group_id
    FROM
        pre_table t2
    ORDER BY
        t2.id1
        t2.id2
        t2.tms

So as I said, a select within a select

How to label groups in postgresql when group belonging depends on the preceding line?

2 Answers2

Linked